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Ford; “What do you get if you multiply six ... by nine — by nine? Is that it?” 

Arthur: “That’s it. Six by nine: forty-two! I always said that there is something fundamentally 
wrong about the universe.” 

The Hitchhiker’s Guide to the Galaxy radio series, episode 6 

“Warum hat er es denn so eilig?” 

N. N. about Arnold Schonhage and his fast multiplication 



Acknowledgment s 


First and foremost, I wish to express my sincerest gratitude to my advisor Prof. Dr. Michael 
Clausen for his help, understanding, advice and encouragement. It was a pleasure to work under 
his guidance. 

Furthermore, I wish to thank Karsten Kopnick and Nicolai Dubben for their proofreading and 
fruitful discussions. Their comments and questions were greatly appreciated. 

I thank Prof. Dr. Arnold Schonhage for inspiring conversations many years ago when I first 
started to get interested in long numbers and also for his friendly comments on this thesis. 

I am grateful to the authors of DTgX for their excellent typesetting system and the authors 
of PGF/TikZ and PGFPlots for their packages to produce beautiful graphics. Furthermore, 
I thank the many contributors on tex.stackexchange.com for the fast and excellent help in 
times of need. 

I thank Martin Winkler and our mutual business for understanding and support of my studies, 
especially when time got tight. 

Many others have supported me, I just want to mention Dirk Eisenack, Heidi Forster, Goran 
Rasched and Susanne Rohrig. I am happy and grateful for their interest in, suggestions for and 
support of my work. 

Also, I thank Anne-Sophie Matheron for asking me one too many times why I didn’t finish my 
diploma. 

Lastly, I thank my family and especially my mother for her boundless faith in me. 


hi 


Contents 


Acknowledgments iii 

List of Figures vi 

Symbols & Notation vii 

1 Introduction 1 

2 Overview of Established Algorithms 4 

2.1 Representation of Numbers. 4 

2.2 Memory Management . 5 

2.3 Ordinary Multiplication. 6 

2.4 Karatsuba Multiplication . 8 

2.5 Toom-Cook Multiplication. 11 

2.6 The Fast Fourier Transform. 12 

2.7 FFT-based Polynomial Multiplication . 16 

2.8 Modular FFT-based Multiplication. 17 

2.9 Modular Schonhage-Strassen Multiplication. 22 

2.9.1 Invertibility of the Transform. 22 

2.9.2 Convolutions. 24 

2.9.3 The Procedure. 27 

2.9.4 Run-time Analysis. 29 

2.9.5 Benchmarking . 30 

2.9.6 Outlook. 33 

3 The DKSS Algorithm 34 

3.1 Overview . 34 

3.2 Formal Description. 34 

3.2.1 Choosing M and m . 35 

3.2.2 Finding the Prime p . 36 

3.2.3 Computing the Root of Unity p . 36 

3.2.4 Distribution of Input Bits. 38 

3.2.5 Performing the FFT. 38 

3.2.6 Componentwise Multiplication. 42 

3.2.7 Backwards FFT . 42 

3.2.8 Carry Propagation. 42 

3.3 Run-time Analysis. 43 

3.3.1 Analysis of each Step . 43 


IV 





























Contents 


V 


3.3.2 Putting Everything Together. 45 

3.3.3 Resolving the Recursion. 46 

3.4 Differences to DKSS Paper . 49 

4 Implementation of DKSS Multiplication 51 

4.1 Parameter Selection . 51 

4.2 A Look at the Code. 52 

4.3 Asserting the Code’s Correctness. 54 

4.4 Execution Time. 55 

4.5 Memory Requirements. 56 

4.6 Source Code Size. 58 

4.7 Profiling. 59 

4.8 Gazing into the Crystal Ball. 61 

5 Conclusion 64 

5.1 Outlook. 65 

A Technicalities 67 

Bibliography 68 















List of Figures 


1 Execution time of OMUL . 7 

2 Execution time of KMUL . 10 

3 Execution time of KMUL (close-up) . 11 

4 Execution time of T3MUL . 12 

5 Splitting an array into even and odd positions. 14 

6 Halving the already shuffled array . 15 

7 Execution time of QMUL . 21 

8 Convolution of two polynomials. 25 

9 Cyclic convolution of two polynomials. 26 

10 Negacyclic convolution of two polynomials. 27 

11 SMUL FET length vs. input length. 31 

12 Execution time of SMUL . 32 

13 SMUL run-time constant a . 33 

14 Encoding an input integer as a polynomial over TZ . 39 

15 Input vector a . 39 

16 Input vector a written as column vectors of 2m elements . 39 

17 Result of inner DFTs as 2m row vectors of n elements. 41 

18 Outer DFTs on 2m row vectors of // elements. 42 

19 Execution time of DKSS_MUL . 56 

20 Memory requirements of DKSS_MUL and SMUL . 57 

21 Profiling percentages for DKSS_MUL. 59 

22 Profiling percentages for dkss_fft 0. 60 

23 Profiling percentages for bad multiplications. 60 

24 Execution times of DKSS_MUL and SMUL . 61 

25 Quotient of DKSS_MUL and SMUL run-times vs. input length. 62 

26 DKSS MUL constant 6 . 63 


VI 




























Symbols Sz Notation 


M 

C 

Z 

N 

No 

[a : b] 

a/bc ■ d 

a I b 

a\b 

{a,b) 

[a\ 

\a\ 

logx 

log“ X 

exp^(x) 
log* X 

a has degree-bound n 
g{n) E 0{f{n)) 


field of real numbers 

field of complex numbers 

ring of all integers: 0, ±1, ±2,... 

{x € Z I X > 0} 

{x E Z I X > 0} 

{x E Z I a < X < 6}, for a, 6 E Z 
(a/(6c)) • d 
a E Z divides 6 E Z 
a E Z does not divide 6 E Z 
greatest common divisor of a, 6 E Z 
max{x E Z I x < a}, floor of a E M 
min{x E Z I X > a}, ceiling of a E M 
logarithm of x to base 2 
(log(x))“ 

n-times iterated exponentiation of x to base b, cf. page 48 

iterated logarithm of x to base 2, cf. page 48 

deg(a) < n, a a polynomial, n E N 

3c > 0, no E No : Vn > no : \g{n)\ < c ■ f{n) 


vii 


Chapter 1 


Introduction 


Multiplication of integers is one of the most basic arithmetic operations. Yet, if numbers get 
larger, the time needed to multiply two numbers increases as well. The naive method to multiply 
requires c-N‘^ bit-operations to multiply numbers with N digits, where c is some constant.^ For 
large numbers this process soon becomes too slow and faster means are desirable. 

Fortunately, in the 1960s methods were discovered that lowered the number of operations suc¬ 
cessively until in 1971 Schonhage and Strassen [SS71] found a technique that only requires 
0{N ■ log Y • log log Y) bit-operations.^ This algorithm was the asymptotically fastest known 
method to multiply until in 2007 Fiirer [Fu07] found an even faster way. Asymptotically means 
that the algorithm was the fastest, provided numbers are long enough. Elaborate algorithms 
often involve some costs for set-up that only pay off if the inputs are long enough.§ 

Fiirer’s algorithm inspired De, Kurur, Saha and Saptharishi to their multiplication method 
[DKSS08], published in 2008, and a follow-up paper [DKSS13], the latter being discussed in this 
thesis and which I call DKSS multiplication. Both Fiirer’s and DKSS’ new algorithms require 
Y • log Y • bit-operations, where log* Y (pronounced “log star”) is the number of times 

the logarithm function has to be applied to get a value < 1. 

However, Fiirer conjectured that his new method only becomes faster than Schonhage and 
Strassen’s algorithm for “astronomically large numbers” [Fu09, sec. 8]. Feeling unhappy about 
this vague assessment, I implemented the DKSS algorithm and compared it to Schonhage and 
Strassen’s method to see if or when any improvement in speed could be achieved in practice. 
Both algorithms use only integer operations, in contrast to Fiirer’s algorithm that uses floating 
point operations. 


The ability to multiply numbers with millions or billions of digits is not only academically 
interesting, but bears much practical relevance. For example, number theoretical tasks like 
primality tests require fast multiplication of potentially very large numbers. Such calculations 
can be performed nowadays with computer algebra systems like Magma, Maple, Mathematica, 
MATLAP, or Sage. Calculation of vr or e to billions of digits or computing billions of roots of 

^Usually, the constant is omitted and instead of c • we write O(Y^). The constant c is hidden in the 

0 (...). 

■^The logarithm function to base 10, logj^Q N, is approximately the number of decimal digits of N. So if N is 
multiplied by 10, the logarithm just increases by 1. This is to show how slowly it is growing. 

®Think of Hnding names in a stack of business cards: if you sort the cards first, you can find a name quickly, 
but it is only worth the effort if you search for a certain number of names. 
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Riemann’s zeta function are other fields that requires fast large number multiplication [GG13, 
sec. 8.0]. 

Also, fast multiplication is an important building block of a general library for arithmetical 
operations on long numbers, like the GNU Multiple Precision Arithmetic Library [GMP14]. 
Addition and subtraction are not hard to implement and many of the more complex tasks — 
like inversion, division, square root, greatest common divisor — revert back to multiplication, 
cf. [GKZ07]. Once these operations are implemented for integers, they can be used to provide 
arbitrary-precision arithmetic for floating point numbers that attenuate rounding problems, cf. 
[GLTZlOj. 

Another big application for multiplication of long numbers is polynomial multiplication with 
integer coefficients, since it can be reduced to one huge integer multiplication through Kronecker- 
Schonhage substitution [Sch82, sec. 2]. If (multivariate) polynomials are of high degree, the 
resulting integers can become very long and fast means for multiplication are essential. Factoring 
of polynomials is also an important field of activity, see [GKZ07]. 


All elaborate multiplication methods use some sort of fast Fourier transform (FFT) at their 
core. The main idea behind all FFT multiplication methods is to break a long number into 
smaller pieces and interpret those pieces as coefficients of a polynomial. Since a polynomial of 
degree less than 2n is uniquely determined by its sample values for 2n pairwise different sample 
points, two polynomials of degree less than n can be multiplied like this:^ 

1. Evaluate both polynomials at the same 2n sample points, 

2. multiply the sample values pairwise, and 

3. interpolate the polynomial from the sample value products. 


The FFT is “fast”, since it computes n sample values with only 0{n ■ logn) operations, which 
is an enormous advance from the naive approach and its O(n^) operations. This method was 
already known by Gauss in 1805 [HJB85], but rediscovered by Gooley and Tukey in 1965 [CT65] 
and then revolutionized computation. 

The method by Schonhage and Strassen breaks numbers of N bits into pieces of length 0{y/N) 
bits. Furthermore, it is cleverly designed to take advantage of the binary nature of today’s 
computers: multiplications by 2 and its powers are particularly simple and fast to perform. This 
is why it has not only held the crown of the asymptotically fastest multiplication algorithm for 
over 35 years, but is also in widespread practical use today. 

The new DKSS multiplication has a better asymptotic time bound, but its structure is more 
complicated. This elaborated structure allows input numbers to be broken into pieces only 
0((logAi)^) bits small. However, the arithmetic operations are more costly. The purpose of 
this thesis is to see if or when DKSS multiplication becomes faster than Schonhage-Strassen 
multiplication in practical applications. 


Chapter 2 of this thesis presents an overview of multiplication algorithms from the naive method 
to techniques that provide a good trade-off if numbers are of medium length (like Karatsuba’s 

^Since the resulting polynomial is the product of its two factors, it has degree 2n — 2. Therefore, at least 
2n — 1 sample points are needed to recover the result. 
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method in Section 2.4). The fast Fourier transform is introduced in Section 2.6 and is followed 
by a detailed description of Schonhage and Strassen’s procedure in Section 2.9. All methods 
were implemented and their run-times are determined theoretically, measured in practice and 
illustrated graphically. Schonhage and Strassen’s algorithm is more thoroughly analyzed in 
respect of its run-time, memory consumption and possible areas for improvement. 

In Chapter 3 the DKSS algorithm is explained in detail and its run-time is analyzed theoretically. 
Section 3.4 describes the differences between my implementation and the paper [DKSS13]. 

Chapter 4 presents details of the implementation and illustrates its run-time (Section 4.4), 
memory requirements (Section 4.5) and source code complexity (Section 4.6) in comparison to 
Schonhage and Strassen’s method both in numbers and graphics. Section 4.8 estimates the 
crossover point at which both algorithms become equally fast. 

Lastly, Chapter 5 sums up the results and shows possible areas for improvement together with 
an assessment of their potential. 


In this version of my thesis the typesetting has been modified to produce a more concise layout 
and some minor errors have been corrected. 



Chapter 2 


Overview 


of Established Algorithms 


This chapter covers the well established algorithms to multiply large numbers, starting with the 
naive method. Methods for medium-sized numbers are discussed, the fast Fourier transform 
is introduced and Schonhage-Strassen multiplication is presented in detail. But first, some 
basic remarks about storage of large numbers and memory allocation for temporary storage are 
necessary. 

2.1 Representation of Numbers 


I assume my implementation is running on a binary computer and the machine has a native 
word size of w bits, so it can hold nonnegative integer values 0... 2*^ — 1 in its general purpose 
registers. We call this unit a computer word. Today, the most common word sizes are 32 and 
64 bits, therefore a machine register can hold integers between 0 and 2^^ — 1 = 4 294 967 295 or 
2^^^ - 1 = 18 446 744073 709 551615, respectively. 

If we want to do calculations with numbers that exceed the aforementioned range, we must use 
some multiple precision representation for them. If we call W := 2“^ the wordbase, we can write 
any nonnegative integer a < as a = Oikk*, with a* € [0 : W — 1]. We can view this as 

representing a with n words or “digits” in base W. 

In my implementation a nonnegative number a < is represented by an array of n words 
as a = (oo, oi,..., a„,- 2 , On-i)- The words are ordered with increasing indices in main memory. 
This ordering is called little-endian. It was a design choice to use this ordering: cache prefetching 
used to work better in forward loops, which are often used due to carry propagation. Modern 
CPUs seem to have remedied this problem. 

The same ordering is used by Schonhage et al. [SGV94, p. 7] as well as in GMP, The GNU 
Multiple Preeision Arithmetie Library [GMP14, sec. 16.1] and MPIR, a prominent GMP fork 
[MPIR12]. Interestingly, Zuras [Zur94] describes that storing numbers as big-endian worked 
better on his compiler and architecture. 

The i-eode in [SGV94, p. 6] stores the length n of the number after the most significant word 
in main memory. In contrast, my implementation keeps the length as a separate integer and 
provides both pointer to array and length as arguments on function calls. 
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Please note that we can usually pad any number with zero words on the upper end without 
influencing the result of operations (except for possible zero-padding of the result). It is a 
small waste of memory and processing time, but can simplify implementation of algorithms, for 
example, if an algorithm expects the length of a number to be even. 

Negative numbers are represented as the two’s complement of their absolute value. I followed 
the example of the i-code from [SGV94] in this design decision. It seems like a sensible choice, 
since execution time of simple operations like addition and subtraction benefit from this repre¬ 
sentation, whereas more elaborate operations like multiplication can afford the slight increase 
in execution time if negative numbers are being handled. 

If negative numbers are handled and padding takes place, they have to be padded with all 
binary ones, that is, words with binary value —1. The most significant bit acts as sign bit if a 
number is interpreted as a signed value. 


2.2 Memory Management 


For all but the most basic functions we will need some temporary memory. To make routines 
fast, it is important that storage can be allocated and freed quickly. This forbids the use of 
the regular C-style mallocO or C++ new (which is just a wrapper for the former). C-style 
mallocO is designed to allow memory of different sizes to be allocated and freed at random 
and still maintain low fragmentation; many implementations are even thread-safe. 

Since lifetime of temporary storage in our algorithms ends when a called function returns, we 
can use a stack-like model for temporary memory, which greatly simplifies the design of the 
allocator, makes it fast and doesn’t need any locking. Plus, it has the added benefit of good 
cache locality. This is known as region-based memory management. In my code, this allocator 
is called tape_alloc. 

To keep allocated memory continuous, every time memory is needed the allocator allocates 
more than is requested and records the total amount of memory allocated. When afterwards 
all memory is freed and later on a new allocation request is made, the allocator will right 
away allocate the total amount of memory used last time. The idea is that since algorithms 
often involve multiple calculations that handle long numbers in the same size-range, upcoming 
memory requirements will be as they were in the past. 

Schonhage et al. implemented their algorithms on a hypothetical Turing machine called TP 
with six variable-length tapes and a special assembly language-like instruction set called TPAL 
(see [SGV94] and [Sch]). Of course, this machine has to be emulated on a real computer, so 
TPAL instructions are translated to C or assembly language for the target machine. Thus the 
tape-like structure of memory is retained. 

The GMP library allocates temporary memory on the stack with allocaO. This should be 
fast and thread-safe, since no locking is required. 
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2.3 Ordinary Multiplication 


All of us have learned to multiply with pencil and paper in school. This is often referred to as 
ordinary multiplication or grade school multiplication. The implementation of it is called OMUL 
(this name and others are inspired by [SGV94]). 

Suppose we want to multiply two nonnegative integers a and b with lengths of n and m words, 
respectively, to compute the product c := ab with length n + m. To do that we have to multiply 
each Oi, i € [0 : n — 1] with each bj, j & [0 : m — 1] and add the product to Cj+j, which has to 
be set to zero before we start. Plus, there has to be some carry propagation. 

In Python 3.x, our OMUL algorithm looks like this.^ I have left out the carry propagation here, 
since this example only serves to show the principle. The C++ example will be more specific. 


def omul(a, b) : 

c = [0] * (len(a) + len(b)) 

for i in range(0, len(a)): 
for j in range(0, len(b)) 
c[i + j] += a[i] * b[j] 
return c 


# initialize result with zeros 

# cycle over a 

# cycle over h 

# elementary mul and add 


This Python implementation hides an important implementation detail: If a multiple precision 
number is made up of words and these are the same size as a processor register, then the product 
of two such words will be twice the size of a processor register! Our code must be able to handle 
this double-sized result. This is not a problem in the Python code above, since Python’s int 
type is multiple precision by itself. A similar function in C++ shows more of that detail: 


void omul(word* c, word* a, unsigned alen, word* b, unsigned blen) { 

memset(c, 0, (alen+blen) * sizeof(word)); // set result to zero 

for (unsigned i=0; i<alen; ++i) { // loop over a[i]^s 

word carry = 0; // for overflow 

unsigned j = 0; 

while (j < blen) { // loop over b[j]^s 

carry = muladdc(c[i+j], a[i], b[j], carry); 

} 

c [i +j] = carry; 

} 

} 


The type word is a placeholder for an unsigned integer type with the size of a processor word or 
smaller. The interesting part happens in the function muladdc (): a[i] andb[j] get multiplied, 
the input carry carry and the already computed result c[i+j] are added to the product, the 
lower part of the result is written back to memory (into c[i+j]) and the higher part of the 
result is saved in carry to be handled in the next iteration. 

We have not yet addressed the problem of the double-sized multiplication result. We have two 
choices here: either use a word type that is only half the processor word size, so the product 
can be stored in a full processor word, or use some special function to get both the high and 
low part of a full sized multiplication in two separate variables. Luckily, modern compilers offer 
an intrinsic function for that and compile good code from it. The other option is still available, 
but takes about 60 % more time here for inputs of the same bit-size.^ 

^The coding style is very un-pythonic and should only serve for explanation. 

^All timings are expressed in processor cycles and were done on an Intel Core 17-3770 CPU in 64-bit mode 
running Windows 7. Appendix A describes the test setup in detail. 
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For a 64-bit word type in Microsoft Visual C++, the muladdcO function looks like this: 


typedef unsigned __int64 uint64; // keep it short 

uint64 muladdc(uint64& mem, uint64 a, uint64 b, uint64 carry_in) { 
uint64 hiprod; 

uint64 lowprod = _umull28(a, b, &:hiprod); // intrinsic function 

hiprod += addcCmem, lowprod, carry_in); 

return hiprod; // carry out 

} 

uint64 addc(uint64& mem, uint64 v, uint64 carry_in) { 
uint64 rl = mem + v; 

uint64 carry_out = rl < v; // overflow? 

uint64 r2 = rl + carry_in; 

carry_out += r2 < carry_in; // overflow? 

mem = r2; 

return carry_out; 

} 


Again, we have to wrestle with word size limitation when handling overflow from addition in 
addc(). Unfortunately, Microsoft’s C++ compiler doesn’t offer a way to read the processor’s 
carry flag. So, we have to do an additional comparison of the result with one of the inputs to 
determine overflow [War02, p. 29]. The resulting code is surprisingly fast, despite the superfluous 
comparison. 

The total run-time of OMUL is easily determined: we have to do nm word-to-doubleword mul¬ 
tiplications, since each Oj has to be multiplied by each bj. The number of additions depends 
on the implementation: the Python version has nm additions, but they are at least triple-size, 
since the carries accumulate. The C++ version has four word-sized additions per multiplication. 

In either case, the total run-time is 0{nm), and assuming m = n it is O(n^). 



Figure 1: Execution time of OMUL 
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This is the “classical” time bound and even in 1956 it was still conjectured to be optimal, since 
no one had found a faster way to multiply for more than four millennia [Kar95]. 

Figure 1 shows a double-logarithmic graphical display of execution times in processor cycles for 
different input sizes. Observe the slight bend at the beginning, which shows the constant costs 
of calls and loop setup. Apart from that the graph is very straight, which shows that caching 
has no big influence, even though the highest input sizes well exceed the level 1 and level 2 
cache sizes on the test machine. 


2.4 Karatsuba Multiplication 


Let a,b< be two nonnegative integers, that is, both numbers consist of maximum 2n words. 
We are looking for a faster way to multiply both numbers to get their product c = ab < 

We can “cut” both numbers in half, that is, express them as 

a = ao + aiW'^ and b = bo + biW^, (2.1) 

with ao, oi, bo, bi < W^. The classical approach to calculate the full product from its four 
half-sized inputs is 


o6 = (ao + aiW")(6o + 6ilT") 

= aobo + (aobi + aibo)W^ + ( 2 . 2 ) 

This way, we can break down a single 2n-word multiplication into four n-word multiplications. 
Unfortunately, we don’t gain any speed by this. 


In 1960 Karatsuba found a faster way to multiply long numbers [K063]. The following slightly 
improved version is due to Knuth [Knu97b, p. 295]. The implementation of it is called KMUL. 

First, we compute the following three n-word multiplications 


Pq = oo^o 

-Pi = (ao - ai)(bo - bi) 

P 2 = aibi 

and use these three “small” products to recover the full product with only some extra additions 
and subtractions plus shifts (multiplications by powers of W): 

ab = Po(l + W^) - PiW^ + P 2 iW^ + W^'^) (2.3) 

= aobo(l + W^) - (ao - ai)(bo - bi)W^ + aMW^ + 

= aobo + aoboW^ - aoboW^ + ao6iIU’" + aiboW^ - aibiW^ + 
aibiW^ + aibiW^^ 

= aobo + (aobi + aibo)W^+ aibiW‘^^. □ 

It looks like more work, but it is a real improvement. Since ordinary multiplication runs in 
O(n^), saving multiplications at the cost of additions, subtractions and shifts, which can be 
done in linear time, is a good deal in itself. But if we use Karatsuba’s algorithm recursively, we 
can even achieve a time bound of 0(n^°®^) ~ 0(v}'^^^). 
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We are going to prove this bound by induction. Denote T(n) the time it takes to multiply 
two n-word numbers. We know that we can reduce a 2n-word multiplication to three n-word 
multiplications and some operations with linear run-time. Furthermore, we have to assign some 
cost to T(l). So 


r(i) = c, 

T(2n) = 3T(n) -|- 2cn. 


We are going to show that 

T(n) = 3cn*°®^ — 2cn. 

This proof originates from [AHU74, p. 63]. It is easy to check the induction basis: r(l) = 
3c • — 2c • 1 = c. Next, we have to check the induction step; 

T{2n) = 3T{n) + 2cn 

= 3(3cn^°®^ — 2cn) + 2cn 
= 3c(3n^°®^) — 6 cn + 2 cn 
= 3c(3n^°§^) - 2 c( 2 n) 

= 3c(2'°sV°s3)-2c(2n) 

= 3c(2n)'°§3-2c(2n). □ 

If we implement this procedure, we first compute the three products Pq, Pi, P 2 and then use 
(2.3) to shift and add up the small products to get the full product. That means, we need some 
temporary storage for the small products and for the two factors that make up Pi. 

To compute the two factors (oq — ui) and (60 — bi) we would like to avoid working with negative 
numbers, to keep things simple. To that end I use a knack (borrowed from GMP) and compare 
the minuend and subtrahend of the subtraction, always subtract the smaller from the larger 
and keep the sign bit in an extra variable. The implementation accommodates for the sign bit 
later when it re-assembles the three sub-products. 

The mentioned ideas look like this when coded in C++: 


void kmulCword* r, word* a, unsigned alen, 
if (alen < blen) { 
swap(a , b) , 
swapCalen, blen); 

} 

if (blen < kinul_thresh) { 

omul(r, a, alen, b, blen); 
return ; 

} 

unsigned lien = blen / 2; 
unsigned ahlen = alen - lien; 
unsigned bhlen = blen - lien; 


word* b, unsigned blen) { 

// b must not be longer than a 
// swap pointers 

// inputs too short? 

// use omul 

// low part length 
// a high part length 
// b high part length 


// compute rO = aO * bO: this will lie in ’r^ on index 0.. llen-1 
kinul(r, a, lien, b, lien); 

// compute r2 = al * bl: this will lie in •’r-’ on index 2*llen. . alen + blen-1 
kmul(r + 2*lien , a+llen, ahlen, b + llen , bhlen); 


// allocate temporary space for differences and third mul 

tape_alloc tinp(4*ahlen + 1); 

word* sa = tmp.p; 

word* sb = tmp.p + ahlen; 

word* ps = tmp.p + 2*ahlen; 

// subtract values for later multiplication 
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bool asign = coinpu_nn (a+llen , ahlen 

, a, lien) 

< 0; 

// 

asign set 

-i-f 

al 

< aO 

if (asign) subu(sa, ahlen, a, lien. 

a+llen , ahlen); 

// 

aO - al > 

0 



else subu(sa, ahlen, a+llen, ahlen. 

a, lien) ; 


// 

al - aO >■ 

= 0 



bool bsign = conipu_nn (b + llen , bhlen 

, b, lien) 

< 0; 

// 

bsign set 

if 

bl 

< bO 

if (bsign) subu(sb, ahlen, b, lien. 

b + llen , bhlen); 

// 

bO - bl > 

0 



else subu(sb, ahlen, b+llen, bhlen. 

b, lien); 


// 

bl - bO > 

= 0 



// multiply both absolute differences 







unsigned plen = 2*ahlen + 1; 



// 

there can 

b e 

a 

carry 


kmulCps, sa, ahlen, sb, ahlen); 
ps[plen-1] = 0 ; 

// compute middle result 

if (asign == bsign) subu_on_neg(ps, plen, r, 2*llen); // ps = rO - ps 
else addu_on(ps, plen, r, 2*llen); // ps += rO 

addu_on(ps, plen, r + 2*llen, ahlen + bhlen); // ps += r2 

// add the final temp into the result 
addu_on(r + llen , ahlen + blen , ps , plen); 


The code checks if input sizes suggest OMUL will be faster and if so, calls it instead. This is 
because KMUL is asymptotically faster than OMUL, but not so for small input lengths. Obviously, 
KMUL is more complicated than OMUL, as it uses several calls to add and subtract, conditional 
branches and temporary memory. All this takes its time compared to a very streamlined double¬ 
loop structure of OMUL that modern processors are really good at executing. 

To achieve maximum performance we have to find the input length where KMUL starts to be 
faster than OMUL. This is called the crossover point or threshold value. The crossover point 
depends on the processor architecture, memory speed and efficiency of the implementation. To 
find the crossover point we have to benchmark both algorithms against one another at various 
input lengths. 



Figure 2: Execution time of KMUL 
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Figure 2 shows the timings of OMUL and KMUL. We can see that KMUL is faster the longer the 
inputs are (with an input length of 10 000 words KMUL is about nine times faster than OMUL), 
but in the low ranges it is slower than OMUL. 

To have a better look at the crossover point, Figure 3 has linear scaling and shows only input 
sizes up to 60 words. From the graph we can see the crossover point is at about 24 words input 
length, that is, about 460 decimal digits. 


2.5 Toom-Cook Multiplication 


Let us take a broader look at Karatsuba’s algorithm: it follows from the fundamental theorem of 
algebra that any polynomial a{x) of degree < k is determined by its values at k distinct points. 
In the case of Karatsuba’s algorithm, if we substitute LF” in (2.1) with the indeterminate x we 
get input polynomials of degree one: a{x) = oq + aix and b{x) = 6o + bix. If we multiply both, 
the result c{x) := a{x)b{x) is a polynomial of degree two. 

What we did in Karatsuba multiplication can be understood as evaluating both polynomials 
a{x) and b{x) at points {0, — 1, oo}.^’^ Then we multiplied the results pointwise and interpolated 
to regain the polynomial c{x). To regain the integer result we evaluated c{x) at x = VF”. 

We can generalize this technique: evaluate polynomials of degree < k at 2k — 1 distinct points, 
multiply pointwise and interpolate. The time bound of this scheme is so for 

A: = 3, 4, 5 it is approximately 0(71^'“^®^), and 0(n^'^®®), respectively. This method is 

due to Toom [Too63] and Cook [Coo66]. 

^By abuse of notation a(oo) means lim^^-oo a{x)/x and gives the highest coefficient. 

Other distinct points of evaluation would have done as well. For example, Karatsuba’s original paper used 
{0, l,oo}. 
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Figure 4: Execution time of T3MUL 


The points for evaluation can be freely chosen (as long as they are distinct), but it is not obvious 
which choice leads to the simplest formulas for evaluation and interpolation. Zuras [Zur94] and 
Bodrato [BZ06] offer good solutions. 

I implemented the Toom-Cook 3-way method from [BZ06] and called it T3MUL. Figure 4 shows 
a graph of execution time vs. input length. The crossover point of T3MUL and KMUL is at about 
100 words or 2000 decimal digits. 

Unfortunately, the exponent in the time bound drops slowly as k increases and the number of 
linear operations (additions, subtractions and divisions by constants) rises quickly with k. This 
leads to ever higher crossover points. Furthermore, each new fc-way method has to be set in 
code individually. This calls for a more general solution. 


2.6 The Fast Fourier Transform 


We are going to have a look at the fast Fourier transform (or EFT) which was (re-)discovered 
in 1965 by Cooley and Tukey [CT65] . By choosing to evaluate the polynomial at certain special 
points it allows us to do the evaluation very quickly. 

This is just a short description of the fast Fourier transform as far as it concerns us now. A 
good introduction can be found in Cormen et al. [CLRS09, ch. 30], Clausen and Baum [CB93] 
cover the topic from a group-theoretic standpoint and Duhamel and Vetterli [DV90] present a 
good overview of the plethora of different FFT algorithms. 
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Let ii be a commutative ring with unity and let n be a power of 2.1 The number n is called 
the EFT length. Let Wn be a primitive n-th root of unity in R, that is, ca” = 1 and / 1 for 
k £ [1 : n — 1], We simply write u instead of Un, if the value of n is clear from the context. 
Furthermore, let a{x) = ^ polynomial over R with degree-bound n.l 

For example, M contains only a primitive 2nd root of unity, namely —1, but no higher orders. 
But C does: ujn = is a primitive n-th root of unity in C. 

Another example is the quotient ring Z/nZ: it can be identified with the integers from 0 to 
n — 1, where all operations are executed modulo n. Z/nZ can contain up to n — 1 roots of unity. 
In the case of n = 17, cu = 3 is a primitive 16**^ root of unity. 

We want to evaluate a{x) at n distinct, but otherwise arbitrary points. If we choose to evaluate 
a{x) at /c G [0 : n — 1], we can design the evaluation particularly efficient. Because cu is a 
primitive n-th root of unity, we know ..., to be pairwise different. 

We can re-sort a(x) in even and odd powers and rewrite it as 

a{x) = oo -b aix + a 2 X^ + a^x^ + ■ ■ ■ 

= ao -b a2X^ -b ... -b aix -b osx^ -b ... 

= ao -b a 2 X^ -b ... -b x(ai -b a^x^ -b ...) 

=:e(a;2) =:o(x^) 

= e(x^) -b xo(x^), 

where e(x) and o(x) are polynomials with half the degree-bound as a(x). Since n is a power 
of 2, we can proceed recursively with this divide-and-conquer approach until the degree-bound 
of both polynomials e(x) and o(x) is one, that is, both consist only of a constant. 

We can evaluate a(x) at /c G [0 ; n/2 — 1] and get 

a(a;^) = e(a;2^)+a;^o(w2fc)^ 

But note what we get if we evaluate a(x) at k £ [0 : n/2 — 1]: 

Q(^fc+n/2) = ^^fc+n/2^((^fc+n/2)2) 

= e(a;2^+”) + 

= e(a;2^) - 


since = — 1 and w"' = 1. 

If we have already computed eiuP'^) and oiuP'^) we save time by calculating both aioj^) and 
^(^fc+n/ 2 ) side by side: 


a(a;^) = e(a;2^)+w^o(w2fc) 

a{^^k+n/2) ^ 

This is the concept that makes the fast Fourier transform efficient. After solving both halves of 
the problem we can calculate two results in 0(1) additional time.^ 

^Please note that n no longer designates an input length in words. 

*Any integer n > deg(a) is called a degree-hound of a. 

®The simultaneous calculation of sum and difference is called a butterfly operation and the factors in front 
of o{uP^) are often called twiddle factors. 
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There are different types of FFTs and the one just described is called a Cooley-Tukey EFT of 
length n. More precisely, it is a radix-2 decimation in time FFT. See [DV90] for other types of 


FFTs. 


We can write this algorithm as a recursive function in Python. The computation of the actual 
root of unity has been left out of this example to keep it independent of the ring R. 


def fft(a): 
n = len(a) 
if n <= 1: return a 
even = fft(a[0::2]) 
odd = f f t (a [1 : : 2] ) 
r = [0] * n 

for k in range(0, n//2): 
w = root_of_unity(n , k) 
r [k] = even [k] + w * odd [k] 

r[k + n//2] = even [k] - w * odd [k] 

return r 


# a is a list 

# degree-bound of a 

# slice up even numbered values 

# ... and odd 

# fill list with n dummies 

# n//2 means integer divide 

# n-th root to k-th power 


Since at each recursion level the input list is split into values with even and odd indices, we get 
the structure shown in Figure 5, if we assume a start with eight input values. 

Notice the ordering of the indices at the lowest level: the values are at an index which is the 
bit-reversed input index. “Bit-reversed” here means only reversing the bits that are actually 
used in indexing: in the last example we had eight values, hence we needed 3 bits for indexing. 
Accordingly, the bit-reversed index of, for example, 03 = aon^ is 110;, = 6 . 

The bit-reversal is a consequence of the splitting of the array into even and odd indices. Since 
even indices have the lowest bit set to zero, all “left” members of the output array have the 
highest bit of the index set to zero, whereas all “right” members have odd indices and have the 
highest bit set to one. This repeats itself through all levels. 

We use this observation to decrease the memory footprint: the fft () function listed above uses 
temporary memory at each level, first to split up the input in even and odd indexed values and 
then to create the list of return values. We would like to save those allocations. Luckily, that is 
possible. The following design and the term “shuffle” is taken from Sedgewick [Sed92, ch. 41]. 

If we reorder the input list according to its bit-reversed indices, all even indexed values are in 
the first half and all odd indexed values in the second half. This saves us the creation of function 
arguments for the recursive calls. All we need to hand over to the lower levels is the position 


(oq, ai, 02 , 03 , 04 , O 5 , 06, 07 ) 



(oo) ( 04 ) ( 02 ) (oe) (oi) ( 05 ) ( 03 ) ( 07 ) 


Figure 5: Splitting an array into even and odd positions 
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and length in the array they should work on, since the values are already in the right order. 
Then the tree of function arguments looks like Figure 6 . 

We don’t even need extra storage for the return values! We can use the memory of the input 
parameters and overwrite it with the return values; the input parameters are no longer needed 
after the function has calculated the return values from them. 

If we put all this into code, our Python function looks like this: 


def bit_rev(x, b): # reverse h lower bits of x 

return sum (1<<(b-1-i) for i in range(0, b) if (x>>i) & 1) 


def shuffle(a): 
r = [] 

b = (len(a)-1).bit_length () 
pos = [bit_rev(n, b) 

for n in range(0, len(a))] 
for i in pos: 

r.append(a [ i]) 
return r 

def fft_eval(a, pos, n): 
half = n//2 
if half > 1: 

fft_eval(a, pos, half) 
fft_eval(a, pos+half, half) 
for k in range(0, half): 
w = root_of_unity(n , k) 
t = w ♦ a[pos+half+k] 
a[pos+half+k] = a[pos+k] - t 
a[pos+k] = a[pos+k] + t 
return 

def fft_inplace (a): 
aa = shuffle(a) 
fft_eval(aa, 0, len(aa)) 
return aa 


# shuffle input list a 

# empty list 

# bits used for indexing 

# list of new positions 

# cycle through list of positions 

# ... and build return list 

# work on a[pos..pos+n-1] 

# integer divide 

# even part 

# odd part 

# n-th root to k-th power 

# multiply only once 

# must use this order 

# ... to avoid overwriting 


# create reordered a 

# fft works in-place 


Let us analyze the number of arithmetic operations of the algorithm above. We have assumed 
that n is a power of 2. With each level the length of the input is halved until n = 1; this 
leads to logn levels of recursion. Furthermore, while the number n gets halved with each 
level, both halves are worked on, so all values are cycled over (see Figure 6 ). Since two return 
values are calculated with three arithmetic operations (two additions and one multiplication), 
the arithmetic cost per level is 3n/2, which leads to a total cost for the whole operation of 
T(n) = 3n/2 • logn. 


(oo, 04 , 02 , 06 , oi, 05 , 03 , 07 ) 



(oo) (04) ( 02 ) (oe) (oi) ( 05 ) ( 03 ) ( 07 ) 


Figure 6 : Halving the already shuffled array 





16 


Chapter 2. Overview of Established Algorithms 


We can prove the run-time more formally (inspired by [Sed92, pp. 77-78]). Obviously, T(l) = 0. 
Then the total arithmetic cost is 

T(n) = 2r(n/2) -|- 3n/2 

= 2(2T(n/4) 3n/4) -h 3n/2 

= 4r(n/4) -h3n/2-h3n/2 


= 2'°s^r(n/2'°g^) + 3n/2 • logn 
= nr(l) -I- 3n/2 • logn 

= 3n/2-logn. (2.4) 


2.7 FFT-based Polynomial Multiplication 


Now that we have introduced the fast Fourier transform and proved its run-time, let us see how 
we can use it to multiply two polynomials rapidly. 

Let ii be a commutative ring with unity and let n be a power of 2. Let w be a primitive n-th 
root of unity in R. Furthermore, let a{x) = Yi!j=o ^ o,jX^ and b{x) = Yi!j=Q ^ be polynomials 
over R. 

Please note that the polynomials a{x) and b{x) have a degree-bound of n/2. Since we are about 
to compute c(x) := a{x)b{x) we need to choose the number of sample points n as n > deg(c) = 
deg(a) -|- deg(6). To keep notation simple, we let Oj = bj = 0 for j E [n/2 : n — 1]. 

We evaluate both input polynomials at sample points A; E [0 : n — 1] to get sample values 
Ofc := and b^ '■= b{oj^). Then, we multiply the sample values pairwise to get the Ck ■= akhk- 

But how to retrieve the result polynomial c{x) from the c^? We will see how to accomplish that 
with ease if R, n and u meet two additional requirements: 

• — 1, A; E [1 : n — 1], must not be a zero divisor in R, and (2-5) 

• n must be a unit in R, meaning n is invertible. (2-6) 


We return to these requirements later. Assuming that they hold, we can prove that the same 
algorithm can be used on the Ck to regain the Cj that was used to compute the Ok and bk in the 
first place! That is to say: the Fourier transform is almost self-inverse, except for ordering of 
the coefficients and scaling. 

Let us see what happens if we use the Ok = a{u}^) = as coefficients of the polynomial 

a{x) := X]fc=o and evaluate a{x) at £ E [0 : n — 1], to compute an := We get 
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what is called an inverse transform: 


Ik 


CLi = a(a;^) 
n—1 

= ^ OkOJ' 
k=0 

n—1 n—1 

= E (E 

k=0 j=0 

n—1n—1 


= EE 


OnUJ 


{j+l)k 


j=0 k=0 
n—1 n—1 

= E%E(‘^‘"')‘ 

j=0 k=0 

Tl * mod ri' 

The last line holds due to the sum of the geometric series: 

n—1 

E(‘^T* = 


k=0 


i0^+^ - 1 
n—1 

Ei = 

A :=0 


n 


□ 


(mod n), 

(2.7) 

(mod n). 

(2.8) 


Now we see why (2.5) is required: if — 1, for A: G [1 : n — 1], is a zero divisor we are not allowed 
to do the division in (2.7). Furthermore, to remove the factor n in front of every «(_£) modn 
need (2.6). 

If we want to get the ai in the same order as the original Oj, we can simply evaluate a{x) at 
io~^ instead of ui^. This is called a backwards transform. 

To summarize: we can evaluate the Ck at points (jJ~^ to retrieve n • q, divide by n and have thus 
recovered the coefficients of our desired product polynomial. 


The overall arithmetic cost of this polynomial multiplication method is three FFTs in 0(n log n) 
plus n multiplications of pairs of sample values in 0{n) plus n normalizations in 0{n). The 
FFTs dominate the total cost, so it is 0(n logn). 


2.8 Modular FFT-based Multiplication 


We can put last section’s method into action and design a fast multiplication algorithm for long 
numbers using the quotient ring R = with prime p. This is sometimes called a number 

theoretic transform or NTT. According to [Knu97b, p. 306] this method goes back to Strassen 
in 1968. 

We want to multiply nonnegative integers a and b to get the product c := ab. We are free 
to choose an arbitrary p for our calculations, as long as last section’s requirements are met. 
Furthermore, our choice of p should be well suited for implementation. If we choose p to be 
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prime it means that the ring 'L/p'L is even a field, so we are sure to meet requirements (2.5) and 
(2.6). This field is denoted Fp. 

For the FFTs we need roots of unity of sufficiently high degree. Let p > 3. Since p is prime, we 
know that F* = {1,2, 3,... ,p — 1} is a cyclic multiplicative group of order p—1. We call <7 G F* 
a generator of F* if its powers g^, j G [0 : p — 2], create the whole group. Also, p is a primitive 
(p — l)-th root of unity. Note that = 1. 

F* contains p— 1 elements. Since p— 1 is even, we can find integers u,v>\, such that p— 1 = uv. 
Without loss of generality, let u be a power of 2. We know that g^~^ = 1, hence = {g'^Y = 1. 
Since v divides p — 1, we know is a n-th primitive root of unity in F*. 

Let us see how to reduce a long multiplication to polynomial mnltiplication: we have to distribute 
the bits of input numbers a and b to coefficients of polynomials a{x) and b{x). In Karatsuba’s 
algorithm we did cut the input numbers in two pieces of n words each, or wn bits, where w is 
the word size and kF = 2^^ is the wordbase. Accordingly, evaluating polynomial a{x) at 
yielded number a. Now we are going to cut the input numbers into pieces of r bits. But how 
to choose r? 

The larger r is, the less coefficients we get, that is, the lower the degree of the polynomial. In 
consequence, this can lead to smaller FFT lengths, which are faster to compnte. This is why 
we want to choose r as large as possible. 

If we multiply polynomials a(x) = ^ ^(®) = Sfc=o ^ product c(x) := 

a(x)b(x) = Z)”=o with cg = J2j+k=e(^Yk, observe that cg can contain up to n/2 summands. 
By constrnction aj, b^ < 2^, hence cg < — n2‘^^~^. But cg must also be less than p. 

Hence, our choice of p must make sure that 

p>n2^^-^. (2.9) 


For practical reasons, we want to choose a prime p that can be handled easily by the target 
machine’s processor, hence I chose p to be almost as big as the wordbase, so it can still be stored 
in one machine word. “Almost” means [logpj = tc — 1 to maximize the use of available bits per 
word. 

The above mentioned constrains led me to choose the following parameters:^ 


Word size (bits) 

Modulus p 

Composition of p 

Generator g 

8 

193 

3 • 2^^ + 1 

5 

16 

40 961 

5 • 2^3 1 

3 

32 

3489 660 929 

13 -228 + 1 

3 

64 

10 232178 353 385 766 913 

71 -2^^ + 1 

3 


T reproduce the numbers here, since it required some effort to calculate them. If one wants to do FFT with 
modular arithmetic, one must first find a suitable prime modulus p and a matching primitive n-th root of unity. 
So here they are. 
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From these numbers we can calculate the respective primitive n-th root of unity oj: 
Word size (bits) Order n of primitive root ca Primitive n-th root oj 


8 

2« 

5^ = 125 

16 

213 

3® = 243 

32 

228 

3^3 = 1594323 

64 

257 

3^^ = 3 419 711 604 162 223 203 


Now that we have chosen p, we can use (2.9) to calculate the maximum r for a given FFT 
length n: 


n22'-i < p 
log(n2^'’“^) < logp 
log n -|- 2r — 1 < logp 

2r < logp — log n -|- 1 

r < ~ logn -|- 1) 

Choosing r determines the degree of the polynomials and hence n, which in turn can have an 
influence on r. So, we might have to cycle several times over this formula to find the largest r 
and smallest n. 

Please note that this also imposes an upper bound on the length of input numbers this algorithm 
can handle: 


logn -|- 2r — 1 < logp 

log n < logp — 2r -|- 1 

logn < w — 2r (since [logpj = w — 1) 

log n < w — 2 (r has to be at least 1) 

n < 2“'"^ 
n < VF/4. 

This determines the maximum FFT length. In this case, r was 1 bit and hence the maximum 
output length is IF/4 bits or IF/32 bytes. Choosing a larger r only makes matters worse. The 
maximum FFT length might be even less than that, since the order of oj limits the FFT length 
as well. 

Now that we have chosen the necessary parameters, we can attend to the implementation. A 
Python version of the main routine looks pretty straightforward. I termed this function QMUL, 
alluding to QuickMul by Yap and Li [YLOO] . 


def qmul(a, b) : 

p, n, w, r = select_parain (a, b) 

al = split_input(a, p, r, n) 
bl = split_input(b, p, r, n) 

al = shuffle(al) 
bl = shuffle(bl) 

fft_eval(al, 0, n, w) 
fft_eval(bl, 0, n, w) 


# p: prime modulus, n: FFT length 

# w: n-th root, r: bits per coefficient 

# split inputs into n parts, ... 

# ... each maximum r bits long 

# shuffle inputs 

# evaluate inputs at roots of unity 
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cl = [] 

for i in range(0, n): 

cl.append(al [i] * bl[i]) 

cl = shuf fie (cl) 
fft_eval(cl, 0, n, w) 

inv = modinv(n, p) 
for i in range(0, n): 
cl[i] *= inv 

c = reasm (cl , r) 
return c 


# empty list 

# multiply pointwise 

# append to list 

# shuffle result 

# evaluate result 

# normalize result 


# reassemble result 


The functions fft_eval() and shuffle() have already been shown. Functions reasmO and 
split_input 0 are new: they cut up the input numbers and add up the coefficients of the 
resulting polynomial to a number, respectively. To find the proper number of bits per coefficient 
r and compute the FFT length n and matching root w function select_param() is used. 

The actual implementation I used for benchmarking was done in C++. To give an impression 
of the code, the following function is a simplified version of the evaluation. The actual code is 
more complicated, since I use C++ templates to unroll the last five levels of the FFT. This saves 
some call and loop overhead at the cost of code size. 


void qinul_evaluate (word* p, unsigned i, un 
unsigned n = 1 << Ig; 
unsigned half = n/2; 
if (half > 1) { 

qmul_evaluate(p , i, lg-1); 
qmul_evaluate(p, i+half, lg-1); 

} 

// w''0=l: no multiplication needed 
word t = p[i+half]; 
p[i+half] = modsub(p[i], t); 
p[i] = modadd(p[i], t); 

// handle w ""k, k>0 

word* pw = pre_w[lg]; // use precomputed roots 

for (unsigned k=l; k<half; ++k) { 

word t = modmul(pw [k] , p[i + half+k]); 
p[i+half+k] = modsub(p[i+k], t); 
p[i+k] = modadd(p[i+k], t); 

} 


signed Ig) { 

// number of values 
// half of them 

// even part 
// odd part 


Functions modadd () and modsub () are pretty straightforward and I don’t include them here, 
but modmul 0 is more complicated. It takes two 64-bit numbers as inputs and multiplies them 
modulo p. We recall that there is an intrinsic compiler function to do the multiplication, but 
the result has 128 bits and has to be reduced modulo p. To accomplish that, we could use a 
128-by-64-bit division, but it is quite slow and takes up to 75 cycles. 

Warren [War02, pp. 178-188] shows a solution how to replace a division by a constant with a 
multiplication and a shift. Since the input has 128 bits, the multiplication has to be 128-bit 
wide as well. But this only results in floor division. To get the rest from division we must do 
one more multiplication and one subtraction. In total, we have to do six 64-bit multiplications, 
plus some additions and shifts to do the one modular multiplication. Benchmarking shows that 
the whole modular multiplication can be done like that in about 10 cycles. 






Chapter 2. Overview of Established Algorithms 


21 



To save the time needed to calculate the roots of unity, I use arrays of precomputed roots 
pre_w[lg]. These powers are independent of the inputs and are reused every time QMUL is 
used. 

A graph of execution cycles of QMUL in comparison to T3MUL is presented in Figure 7. Please 
note that this graph covers a wider range of sizes than the previous graphs. We see that our 
new algorithm is asymptotically faster than the hitherto used T3MUL implementation. 

The stair-like shape of the graph is a result of the FFT: if input numbers get too large, the FFT 
depth must be increased by one level, thereby doubling the number of evaluation points. From 
these graphs we can say that QMUL starts to outperform T3MUL for inputs with a length of about 
110000 words or more, that is, about 2100 000 decimal digits. 


So we found an algorithm with a good asymptotic cost, but it only starts to pay off if inputs 
are quite long. Why is that so? What are the weak points of QMUL? 


• The modular multiplication and reductions are expensive. Six word-sized multiplications 
are not cheap. 

• The FFT length is large and hence many extra bits room for the sum of the coefficient 
products must be left free. Since the unit of operation is only a processor word, this “eats 
up” quite some percentage of its size. Plus, it implies a large FFT depth as well. 

• The maximum length for long numbers is limited to kF/32 bytes, even if larger numbers 
could be handled by the machine. 


The following celebrated algorithm will address all of the weak points listed above. 








22 


Chapter 2. Overview of Established Algorithms 


2.9 Modular Schonhage-Strassen Multiplication 


The idea of Schonhage-Strassen multiplication (of which my implementation is called SMUL) is 
to perform FFTs in rings of the form Z/(2^ + 1)Z, which are sometimes called Fermat rings. 
In Fermat rings, 2 is a primitive 2iF-th root of unity. This fact can be exploited to speed 
up multiplications by roots of unity during the FFTs: multiplications by powers of 2 can be 
implemented as shifts followed by a modular reduction and thus take only 0{K) time. This is 
the cornerstone of the efficiency of Schonhage and Strassen’s multiplication. 

Since all roots of unity are powers of 2, we don’t need to precompute them as in QMUL, but can 
just keep track of shift counts. Furthermore, modular reductions are simple and can be done 
with just another long number addition and/or subtraction. In this context, a shift followed by 
a modular reduction is called a eyelie shift. 

SMUL always computes the product modulo 2'^ -t-1, where N can be chosen. If we multiply input 
numbers a and b to get c := ab, we have to provide an N that is big enough to hold the full 
result. SMUL reduces a multiplication with N bits to smaller multiplications of iF ~ \/iV bits, in 
contrast to a reduction to word size as in QMUL.l If the size of pointwise multiplications exceeds 
a certain threshold, SMUL is used recursively, otherwise a simpler algorithm takes over. 

This algorithm was first published by Schonhage and Strassen in 1971 [SS71] and provided 
results modulo 2^ + 1, where N itself is a power of 2. A later version published by Schonhage 
[Sch82] relaxes the requirement to “suitable numbers” of the form N = z^2”, n G [n — 1 : 2n — 1]. 
For the implementation we can relax the requirement even more: Section 2.9.3 lists the details. 

We introduce some notation: to compute the product c of nonnegative numbers a and b, we do 
FFTs in the ring R := Z/(2^ + 1)Z. We use a Cooley-Tukey FFT and thus the FFT length 
n has to be a power of 2. Since we can choose R (and hence K) to suit our needs, we choose 
K = r2”*, with positive integers r and m. Our choice of K and m will in turn determine N, 
where N = s2”*, with positive integer s. This s is the number of input bits per coefficient. 

It it easy to see that 2 is a primitive 2iF-th root of unity: since 2^ -|- 1 = 0, we have 2^ = — 1 
and hence 2^^ = 1. Furthermore, it is obvious that for u S [1 : iF — 1] we get 2“ ^ ±1. For 
iF -L u =: u E [iF + 1 : 2iF - 1] we see that 2“ = 2^+^ = 2^2^ = -2'’ ^ 1. 

Because the FFT length is a power of 2, we need a primitive root of unity of the same order. 
Since 2 is a primitive root of unity of order 2K = 2r2”^, it holds that 1 = 2^^ = = (2^”)^"*. 

This makes oj := 2^” a primitive 2™'-th root of unity and the FFT length n := 2”*. We deliberately 
chose an even exponent for cu, since we will be needing y/u later. 

2.9.1 Invertibility of the Transform 

For the existence of the inverse FFT requirements (2.5) and (2.6) have to be met. Since 2^ +1 
may not be prime, we cannot rely on our argument from Section 2.8, so we must show that the 
requirements are met here, too: 

• With uj = 2^”, — 1, j E [1 : n— 1], must not be a zero divisor in Z/(2^ + 1)Z, and (2.10) 

reduction to word size is usually not possible in SMUL, because the FFT length is not sufficiently large to 
cut input numbers in parts so small, since there are not enough roots of unity. 
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• n = 2™ must be a unit in Z/(2'^ + 1)Z. 


( 2 . 11 ) 


To prove (2.10) we need some identities about the greatest eommon divisor (gcd): Let a, h 
and u be positive integers and (a, b) signify the greatest common divisor of a and b. Then the 
following identities hold: 


II 

(2.12) 

(a, 6) = (a — 6, 6), if a > b, 

(2.13) 

{ua, ub) = u{a, b), 

(2.14) 

{ua,b) = {u,b){a,b), if {u,a) = 1, 

(2.15) 

(ua, b) = (a, b), if u f 6, 

(2.16) 

,2'’ - 1) = 2(“’'’) - 1, 

(2.17) 

,2“ + l) = l. 

(2.18) 

2 ia, 2 b) _ 1 

’ ^ + 1) - 20,b) - 1 • 

(2.19) 


Identities (2.12) - (2.16) are well known, so we don’t prove them here. 

We prove (2.17) by induction on a + b. We assume without loss of generality that a > b. The 
induction basis is easily checked: (2^ — 1, 2^ — 1) = 1 = 2 ^^^) _ x. 

Now we show the induction step: we assume (2“ — 1, 2^ — 1) = 2^“’^^ — 1, for a + /3 < a + b. We 
use (2.13) and get 

(2“ - 1,2^ - 1) = (2“ - 1 - (2^’ - 1), 2^ - 1) 

= (2“-2^2^-l) 

= (2^(2“-ft_l),2'’-l) 

= ( 2 “"'’ - 1 , 2 ^’ - 1 ) 

_ 20-hb) _ ^ 

— 20h) _ 


(by (2.16)) 

(by IH) 

(by (2.13)) □ 


To prove (2.18) we use (2.13) and see that 

(2^ - 1,2^ + 1) = (2^ - 1, 2^ + 1 - (2^ - 1)) 

= (2'-l,2) 

= 1 . □ 

To prove (2.19) we use the well known difference of squares — 6^ = (a + b){a — b) and apply 
it to our case, where it yields 2^^ — 1 = (2^ + 1)(2^ — 1)- It holds that 

2 (a, 2 fe) _ 1 ^ (2“ - I, 2^'’ - I) (by (2.17)) 

= (2“-l,(2^ + l)(2^-l)) 

= ( 2 “ - 1 , 2 '’ + 1 )( 2 “ - 1 , 2 '’ - 1 ) 

= ( 2 “ - 1 , 2 '’ + 1 )( 2 (“’'’) - 1 ) 


(by (2.15) and (2.18)) 
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Divide by 2^“’^) — 1 and get 


2(a,2fe) _ 2 

20h) - 1 


(2“ - 1,2'' + 1). 


□ 


Recalling that ov = 2^'", n = 2’" and X = r2™' we can now prove (2.10) by showing that 
{u!^ — 1, 2^ + 1) = 1, for j E [1 : n — 1]. Thus all — 1 are units and therefore no zero divisors. 


- 1,2^ + 1) = {{2^^y - 1, 2''2’" + 1) 

= (22'’J-1,2'’2"*+1) 

2{2n,2r2^) _ 

2(2rj,'r2'^) _ 

22r(i,2’") _ I 
- 22r(i,2—1) _ I ■ 


(by (2.19)) 
(by (2.14)) 


Since j < 2™ it is clear that (j, 2”^) = (j, 2"* ^). Hence 


(u;^'-l,2^ + l) 


22r(j,2'—1) _ 
22r(j,2—1) _ X 


□ 


Still open is (2.11). For n = 2^” to be a unit in 'Ll{2^ + 1)Z, there must exist an i with 
2™'i = 1 = 2^^. Obviously, i = works. □ 


2.9.2 Convolutions 

Schonhage-Strassen multiplication always computes results modulo 2^ + 1. If it is used recur¬ 
sively to compute the pointwise products this comes in handy, since it allows multiplications 
where the results are in [0 : 2^] without performing a modular reduction. This lowers the FFT 
length and thus the execution time by a factor of two. We will now see how to use convolutions 
to accomplish this. 

If a{x) = YUiZo o.ix'' and b{x) = J^j'ZobjX^ are two polynomials with coefficients a*, bj E R, 
then the coefficients of their product c{x) := a{x)b{x) = are given by the (acyelic) 

eonvolution formula 

Ck= (2.20) 

i-\-j=k 

Figure 8 shows the product of two polynomials a and b both with degree three. The lines 
from top-left to bottom-right are convolutions with the dots being products of the individual 
coefficients. For each convolution the sum of the indices of the coefficient products is constant. 
As Gilbert Strang put it: “You smell a convolution when [the indices] add to [k]” [StrOl]. 

In the process of the FFT as laid out in Section 2.7, two input polynomials are evaluated at 
w*, i E [0 : n — 1], where w is a primitive n-th root of unity. Afterwards, the sample values are 
multiplied pointwise and transformed backwards to get the product polynomial. 
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Figure 8: Convolution of two polynomials 


Define the mapping 


: R[x] R^, 

a{x) (a(w°, • • •, a{oj'^~^)). 

The kernel of cj) is the ideal generated by Y\!iZo{x — w®). Since w” = 1, surely (a;®)”' = 1 holds 
as well, for f G [0 : n — 1]. So the polynomial x®® — 1 yields zero for each x = w®, hence it has n 
distinct roots and the n linear factors x — ca®. From that we conclude that 
and hence that the kernel of cj) is the ideal generated by x®® — 1. 

This means that polynomial multiplication that uses the mapping cj) always gives results modulo 
X®® — 1. This is called the eyclie convolution of two polynomials. Given the aforementioned 
polynomials a(x) and h{x) it produces the product polynomial c(x) with coefficients 

Cfc = 

i-\-j=k 
(mod n) 


Figure 9 shows the cyclic convolution of two polynomials of degree three. Here, the upper half 
of coefficients “wraps around” and is added to the lower half. This is why it is sometimes called 
a wrapped convolution. 

We now know that a cyclic convolution gives us results modulo x®® — 1. Can we get results 
modulo X®® + 1? Schonhage shows us we can. 

Since i, j, k < n we can write (2.21) as 

c/c — 'y ( aibj T y ( aibj. (2.22) 

i-\-j=k i-\-j=n-\-k 

The second sum contains the higher half product coefficients that wrap around and are added 
to the lower half coefficients, since x®® = 1. But if we want results modulo x®® + 1, it holds that 
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Figure 9: Cyclic convolution of two polynomials 


= —1, hence what we are looking for is a way to compute 

Cfc — ^ ' o,ibj ^ ^ Oibj. 

i-\-j=k i-\-j=n-\-k 


(2.23) 


Schonhage’s idea is to weight each of the coefficients o* and hj prior to the cyclic convolution 
in such a way that for i + j = n + k and k < n it holds that 9'^aibj = —aibj, for some 9 & R 
that we will specify immediately. This puts the desired minus sign in front of the second term 
in (2.23). 

Choose the weight 9 as follows: let 0 be a primitive n-th root of —1, that is, 0"' = —1 and hence 
9^ = u. To compute (2.23), we use (2.21), but weight the inputs like 


Oj := and bj := 9^bj 


and apply the proper “counterweight” 9 ^ to the whole sum, so we get 


Ck = 9 ^ 




i-\-j=k 
(mod n) 


— 


Y + Y 

2 +ji'=/c i-\-j=n-\-k 

= 9-^ Y. &"ai9^bj + 9-’^ Y 0"ai9^bj 


i-\-j=k 


i-\-j=n-\-k 


— Q-k 


Y + 9-^ Y ^ 

i-\-j=k i+j=n-\-k 

= Y, ^ibj + Y 

i^j=k i+j=n-\-k 

= Y^ “ Y^ aibj. 


\n-\-k 


Ciibj 


i-\-j=k 


i-\-j=n-\-k 


(2.24) 

(2.25) 


□ 
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Figure 10: Negacyclic convolution of two polynomials 


This is called a negacyelic or negative wrapped convolution. Figure 10 shows a diagram of it. 
Please note that 6 is in Z/(2^ + 1)Z as well a power of 2, so weighting can be done by a cyclic 
shift. 

According to (2.23), the Ck can become negative. Yet, we are looking for nonnegative c). = Ck 
(mod 2^ + 1) with c). € [0 : 2^]. If Ck < 0, we can find c'j. '.= Ck + 2^ + 1. 

2.9.3 The Procedure 

We are now all set to describe the whole procedure: given nonnegative integers a and b find 
their product c := ab modulo 2-^ + 1. 

Since the product is computed modulo 2-^ + 1, we must choose N big enough for the full product 
c. If we choose N > [logo] + [log6] this is surely the case. 

Denote R the ring Z/(2'^ + 1)Z, for some K = r2™'. Let n := 2"^ be the FFT length and 
let s := [A^/n] be the bit length of input coefficients cut from a and b. Then our choice of 
parameters has to meet the following constraints: 


• R must contain a primitive n-th root of unity to that is an even power of 2. (2^^)*^ = 1 = 
2^^ leads to the sufficient condition 

n I K. (2.26) 

• R must be big enough to hold the convolution sums. Because of (2.23) , the Ck G [—n2^*+l : 
n2^® — 1], so the total range has size 2n2^^ — 1. Hence select K so that 2^ +1 > 2n2‘^^ — 1 = 
2 m+ 2 s+i _ ^ jg sufficient to select 


K > m + 2s + 1. 


(2.27) 


For increased speed, we might want to choose a larger K that contains a higher power 
of 2. We will perform benchmarking later to find out if it pays off. 
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These constraints lead to values for the FFT length n := 2"^, the number of input bits per 
coefficient s := [Ai/n], and K = r2™’ > m + 2s + 1. This in turn forces a new, maybe slightly 
higher value for N := s2'^, and determines oj := 2^'' and 9 := 2^. Given those parameters, we 
can proceed like we did with QMUL in Section 2.8, but with some alterations: 

1. Split both input numbers a and b into n coefficients of s bits each. Use at least K + 1 bits 
to store them, to allow encoding of the value 2^. 

2. Weight both coefficient vectors according to (2.24) with powers of 6 by performing cyclic 
shifts on them. 

3. Shuffle the coefficients Oi and bj. 

4. Evaluate a* and bj. Multiplications by powers of u are cyclic shifts. 

5. Do n pointwise multiplications Ck '■= a^bk in Z/(2^ + 1)Z. If SMUL is used recursively, 
provide K as parameter. Otherwise, use some other multiplication function like T3MUL 
and reduce modulo 2^ + 1 afterwards. 

6. Shuffle the product coefficients Ck- 

7. Evaluate the product coefficients c^. 

8. Apply the counterweights to the according to (2.25). Since 0^" = 1 it follows that 

0—k — Q2n—k 

9. Normalize the Ck with lln = 2“™ (again a cyclic shift). 

10. Add up the and propagate the carries. Make sure to properly handle negative coeffi¬ 
cients. 

11. Do a reduction modulo 2^ + 1. 

If SMUL is used recursively, its input parameter N cannot be chosen freely. The calling SMUL 
provides its parameter K as the input parameter N of the called SMUL. 


I implemented some optimizations to the procedure outlined above to save execution time: 

• Steps 1, 2 and 3 can be combined. Furthermore, since some high part of a and b is virtually 
zero-padded, initialization of that part can be done quickly. 

• Steps 8 and 9 can be combined. 

• On the outermost SMUL, where N can be chosen, we don’t need to do a negacyclic trans¬ 
form. This lets us skip the weighting of o^, bj and Ck in Steps 2 and 8. We don’t check 
for negative coefficients in Step 10 and don’t need the reduction in Step 11. Furthermore, 
we don’t need 6 = y/ui and thus can extend the FFT length by another factor of 2. The 
sufficient condition for selecting K relaxes to n \ 2K. 

• The cyclic shifts often shift by multiples of the word size w, where a word-sized copy is 
faster than access to individual bits. 
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2.9.4 Run-time Analysis 

Let us analyze the cost of SMUL to compute a product modulo 2^^ + 1 and call this cost T{N). 

According to (2.26), it is clear that K > 2”^, but we will show the time bound for the condition 

K = 2™. (2.28) 

This means that we might have to choose a K that is larger than required by (2.26) and (2.27), 
but our choice only increases K by at most a factor of 2. 

Furthermore, according to (2.27), A > m + 2s + 1, where s = [A/2”^]. Surely we will find a 
suitable K = 2"^ < 2(m + 2s + 1). So for sufficiently large values of N 

m + 2iV/2™ + 1 < A < 2m + 4A/2™ + 2 
2N/K <K < 5N/K 

2iV < < 5N. (2.29) 

Vm <K< (2.30) 


Steps 1, 3, 6 and 10 have obviously cost 0{2^K). The same applies to Steps 2, 8 and 9, since 
the cost of cyclic shifts modulo 2^ + 1 is 0{K) as well. By the same argument Step 11 has cost 
0{N). 

According to (2.4), the FFT evaluation costs O(nlogn), with n = 2”^, but we have to take into 
account that in contrast to (2.4), multiplications by roots of unity don’t cost 0(1) here, but 
0(A), so the cost of evaluation in Steps 4 and 7 is 0(m2”^)0(A). That leaves Step 5, where 
we have 2™ multiplications modulo 2^ + 1, so the cost of that is 2^T{K). 

If we add everything up, we get for the total cost 

T{N) = 0(2™ A) + 0{N) + 0(m2™)0(A) + 2™r(A). 

Using (2.28) and (2.29) we get 

T{N) = 0{N) + 0{mN) + KT{K) 

= 0{mN) + KT{K). 

By (2.30) we know that 2™ = A < \/5A, hence m ^ log (5A). Ergo 

T{N) = O(AlogA) +0(\/A)r(V^). 

Unrolling the recursion once leads to 

T{N) = 0{N log N)+0 {Vn){ 0{V^log V^) + 0{\fm)T{\f¥N)) 

= O{N log N) + 0{^/N){0{^/N log N) + 0{<fN)T{^f¥N)) 

= 0{NlogN) + 0(Alog A) + 0{V^)T{yf¥N). 

' -V-" 

=:A 
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After log log steps the remaining summand A < 0{N): 

T{N) = 0{N log N)+0{NlogN) + ... + 0{N) 

' -V-" 

log log N times 

= 0{N log N)+0{N log N) log log N + 0{N) 

= 0{N ■ log N- log log N). (2.31) 

To see why it takes log log N steps, observe that the order of the root doubles with each recursion 
step. Hence, after log log steps the order has reached = log A =; A. So the remaining 

summand A < 0{Vn^)T{V5^^) < 0{N)T{5VN). Lastly, VN = ivVi°g^ = 2 and 
hence A < O(A)r(10) < 0{N). 

Until the discovery of Fiirer’s algorithm [FiiOT] in 2007 this was the lowest known time bound 
for a multiplication algorithm. 


Now let us look at memory requirements. Memory needed for all 2"^ coefficients of one input 
number is 2™AT bits. According to (2.27) with s = |'A^/2™'] it holds that K > m + 2|'A^/2™'] +1. 
Hence memory requirements in bits for one polynomial are 

2mK >2^.{m + 2[iV/2™] + 1) 

> 2”^ • (m + 2iV/2™ + 1) 

> 2"* • 2iV/2™ 

> 2N. 

Temporary memory is required for both input polynomials, but for the resulting polynomial 
storage of one of the input polynomials can be reused. SMUL needs some memory for the 
multiplication of sample points, but this is only of the size of one coefficient, that is, K bits 
and doesn’t change the order of the approximation. Hence, if N denotes the bit length of the 
product and Msmul(A^) denotes total memory required by SMUL, it holds that 

AfsMUL(A^) ~ 4iV bits. (2.32) 

Figure 20 shows measured memory requirements, but note that in that table N refers to the 
bit length of one input, where in this section N denotes the bit length of the produet. 


2.9.5 Benchmarking 

Apart from optimizing the implementation on higher (see page 28) and lower levels (like assembly 
language subroutines) benchmarking shows that we can save quite some execution time by 
finding the fastest FFT length n from all possible values. 

For this, we measure execution cycles for multiplications with different possible FFT lengths. 
In principle, larger FFT lengths lead to faster multiplications, but the largest possible FFT 
length is usually not the fastest. Larger FFT lengths lead to smaller coefficient sizes, but 
more operations on the coefficients. On the other hand, the value of the primitive n-th root 
Lv might allow byte- or even word aligned (or even better SSE-word aligned) shifts, which can 
be implemented faster than general bit-shifts. The smaller the FFT length, the better the 
alignment for the cyclic shifts. 
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Figure 11; SMUL FFT length vs. input length 


Maybe even more importantly, certain values of K that contain high powers of 2 allow for larger 
FFT lengths in the recursively called SMUL. So sometimes larger K work much faster, even if 
the FFT length stays unchanged. 

As a result, the fastest FFT length switches several times until it settles for a higher value. 
Figure 11 gives an impression of this. The '/N graph is printed for orientation, since 2™ > 

The graph of execution cycles vs. input lengths is shown in Figure 12. We can see that it is well 
below the QMUL graph, but intersects with the T3MUL graph at about 2500 words, that is, about 
47 500 decimal digits. Furthermore, we observe a certain “bumpiness” of the graph, which is 
a result of the changing FFT lengths and ring sizes. Yet, it is much smoother than the QMUL 
graph. 

Lastly, we try to model the run-time according to its theoretical value (2.31) for large values of 
N. If we write the run-time with an explicit constant, then 

Ta{N) < a ■ N ■ logA^ • log log W (2.33) 

Dividing measured execution cycles by N ■ log N ■ log log N to calculate a leads to the graph 
depicted in Figure 13. Please note that here N is the length of the product in bits. Interestingly, 
this graph seems to have two plateau-like sections. 

The first plateau ranges roughly from 12 800 to 8 000 000 input bits and the second plateau 
starts at about 32 000 000 input bits. Since SMUL requires about 4Ai bits of temporary memory, 
the above numbers indicate a plateau from 12 KB to 8 MB and another starting from 32 MB 
temporary memory. This corresponds quite nicely with the cache sizes of the test machine (see 
Appendix A). Such an influence on the run-time constant cr is no longer visible only after the 
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Figure 12; Execution time of SMUL 


required temporary memory is some orders of magnitude larger than the cache size. In our case 
that would be starting from about 32 MB temporary memory. 

Since after level 3 there are no other caching mechanisms, we can average a for input sizes 
above 32 Mbits (since a word is 64 bits, this equals 512 Kwords) and that leads to an average 
0- 0.3159. 
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Figure 13: SMUL run-time constant tr 


2.9.6 Outlook 

There are some possible improvements which might lower execution time that I have so far not 
implemented or tested. Namely, this is: 


• I benchmark different FFT lengths n to find the fastest one, but I could benchmark 
different ring sizes K as well. It is sometimes profitable to use larger values for K, if K 
contains higher powers of 2. 

• Schonhage’s \f2 knackf [SGV94, p. 36, exercise 18]. We can increase the transform length 

by a factor of 2 by noticing that _ 2^/^ is a primitive 4iF-th root of unity, since 

= 2 (mod 2^-|-1) and 2^^ = 1 (mod 2^ -|-1). In [GKZ07], the authors mention a 10 % 
speed increase. That paper also contains some other promising fields of optimization. 

• The current implementation cuts input numbers into coefficients only at word boundaries. 
Maybe cutting them at bit boundaries might lower K slightly for some input lengths. 

• The procedure outlined in [GP05, pp. 502-503] saves one bit in K by selecting iF > m -|- 2s 
and then uses a sharper bound for the than I did by noticing that each can only 
have k + \ positively added summands, see Figure 10. 


^[GKZ07] call it \/2 trick, but Schonhage interjects that “trick” sounds too much like a swindle, so I call it a 
knack instead. 




















































Chapter 3 


The DKSS Algorithm 


This chapter describes DKSS multiplication, especially how it employs the fast Fourier trans¬ 
form, and analyzes its execution time theoretically. Finally, differences between my implemen¬ 
tation and the DKSS paper are described. 


3.1 Overview 


Schonhage and Strassen’s algorithm for fast multiplication of large numbers (implemented as 
SMUL, see Section 2.9) uses the ring R = Z/(2^ -|- 1)Z and exploits the fact that 2 is a primitive 
2K-th root of unity in R. This permits the crucial speed-up in the fast Fourier transform: 
multiplications by powers of the root of unity can be realized as cyclic shifts and are thus 
considerably cheaper. An A-bit number is broken down into numbers that are 0{y/N) bits long 
and when sample values are multiplied, the same algorithm is used recursively. 

The DKSS algorithm (its implementation is called DKSS_MUL here) keeps this structure, but 
extends it further. Where SMUL used the ring Z/(2^ -|- 1)Z with 2 as root of unity, DKSS 
multiplication uses the polynomial quotient ring TZ := V[a]/{a'^ + !)• Since a™' = —1, a is a 
primitive 2m-th root of unity and again multiplications by powers of the root of unity can be 
done as cyclic shifts. Underlying TZ is the ring T’ ■= Z/p^Z, where p is a prime number and 2 : is 
a constant. This “double structure” can be exploited in the FFT and allows to break down an 
A-bit input number into numbers of 0(log^ A) bits. 

In their paper [DKSS13], De, Kurur, Saha and Saptharishi describe the algorithm without any 
assumption about the underlying hardware. Since we are interested in an actual implementation, 
we can allow ourselves some simplifications, namely the precomputation of the prime p, and as 
a consequence drop their concept of /c-variate polynomials by setting k = 1. Section 3.4 explains 
the differences between my implementation and the original paper in more detail. 


3.2 Formal Description 


We want to multiply two nonnegative integers a, 6 < 2^, A E N to obtain their product 
c ■= ab < 2^^. As usual, we convert the numbers into polynomials over a ring (denoted TZ), 
use the fast Fourier transform to transform their coefficients, then multiply the sample values 
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and transform backwards to gain the product polynomial. From there, we can easily recover 
the resulting integer product. 

Denote TZ := V[a\/{a^ + !)• As usual, we identify TZ with the set of all polynomials in V[q\ 
which are of degree less than m and where polynomial multiplication is done modulo (a”* + !)• 
Polynomial coefficients are in V and are called inner coefficients. Furthermore, define V ■= 
'Lj'p^T,^ where p is a prime number and z is a constant chosen independently of the input. We 
will see how to choose p shortly. 

Input numbers a and b are encoded as polynomials a{x) and b{x) £ TZ[x] with degree-bound M. 
That is, a(x) and b{x) are polynomials over TZ whose coefficients are themselves polynomials 
over V. Call the coefficients of a{x) and b{x) outer coefficients. 

This outline shows how to multiply a and b. The following Sections 3.2.1 - 3.2.8 contain the 
details. 

1. Choose integers m > 2 and M > m as powers of 2, such that m ~ logA^ and M ~ 
N/\o^ N. We will later perform FFTs with length 2M, while m is the degree-bound of 
elements of IZ. For simplicity of notation, define p := 2M/2m. 

2. Let u := [2A^/Mm] denote the number of input bits per inner coefficient. Find a prime p 
with 2M \ p — 1 and p^ > Mm2^“. The prime power p^ is the modulus of the elements of 

V. 

3. From parameters M, m and p compute a principal (see Section 3.2.2 for definition) 2M-th 

root of unity^ p £ TZ with the additional property that p 2 M/ 2 m _ property plays 

an important part in Step 5. 

4. Encode a and b as polynomials a{x), b{x) £ TZ\x\ with degree-bound M. To accomplish 
that, break them into M blocks with rim/2 bits in each block. Each such block describes 
an outer coefficient. Furthermore, split those blocks into m/2 blocks of u bits each, where 
each block forms an inner coefficient in the lower-degree half of a polynomial. Set the 
upper m/2 inner coefficients to zero. Finally, set the upper M outer coefficients to zero 
to stretch a{x) and b{x) to degree-bound 2M. 

5. Use root p to perform a length-2M fast Fourier transform of a{x) and b{x) to gain a* := 
a(/?*) £ TZ, likewise hi. Use the special structure of TZ to speed up the FFT. 

6. Multiply components Ci := Oibi. Note that a*, hi £ TZ are themselves polynomials. Their 
multiplication is reduced to integer multiplication and the DKSS algorithm is used recur¬ 
sively. 

7. Perform a backwards transform of length 2M to gain the product polynomial c(x) := 
a{x)b{x). 

8. Evaluate the inner polynomials of the product polynomial c(x) at a = 2“ and the outer 

polynomials at x = to recover the integer result c = ab. 


3.2.1 Choosing M and m 

Choose m > 2 and M > m as powers of 2, such that m ~ logN and M ~ N/log^ N. For the 
run-time analysis, the bounds M = 0{N/ log^ N) and m = O(logN) are more convenient. 

^We simply write p instead of p(a), keeping in mind that p itself is a polynomial in a. 
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3.2.2 Finding the Prime p 

We use the following definition that captures the requirements for the existence of the inverse 
FFT transform (cf. Sections 2.7 and 2.9.1): 

Definition. Let R be a commutative ring with unity. A primitive n-th root of unity f ^ R 
is called principal if and only if X)r=^o^(C'^)* ~ j S [1 : n — 1], and n is coprime to the 

characteristic of R. 

Since numbers are encoded as polynomials with degree-bound M (Step 4) and then multiplied, 
the result has a degree-bound of 2M, so we need a principal 2M-th root of unity for the FFTs. 
li p '■= h- 2M -|- 1 is prime for some h E N (primes of this form are called Proth primes) , we can 
compute a principal 2M-th root of unity uj in Z/p^Z. Section 3.2.3 shows how it is done. 

Why is > Mm2^“ required? Since both a{x) and b{x) have degree-bound M, each outer 
coefficient of their product c(x) is the sum of up to M outer coefficient products. Each of 
these products is the sum of up to m/2 inner coefficient products, with each factor < 2“ by 
construction. So the inner coefficients can take values as high as ^Mm(2“ — 1)^. If we choose 
p^ > Mm2^“, we are on the safe side. 

But does a prime of the form p = h ■ 2M 1 exist for all Ml We can answer that in the 
affirmative with the help of the following 

Theorem (Linnik [Lin44a], [Lin44b]). For any pair of coprime positive integers d and n, the 
least prime p with p = d (mod n) is less than £n^, where £ and L are positive constants.^ 

We want to show the existence of a prime p with p = 1 (mod 2M), but also require p^ > Mm2^“. 
Since Linnik’s Theorem makes only a statement about the first prime, we must check that 
this prime to a constant power matches the requirement. An easy calculation shows that 
(2M -|- 1)® > Mm2^“. As p is of the form p = h - 2M + 1, we see that for every /i E N and every 
z > 6 this means that p^ > Mm2^“. With the size condition resolved, we use Linnik’s theorem 
to show that p < i{2M)^. □ 

To get an estimate of p in terms of N, we recall that M = 0{N/\o^ N) and see that 

In the implementation I tested candidate primes p for primality by using the Lucas-test [CP05, 
sec. 4.1] that allows for fast deterministic primality testing if the full factorization of p — 1 is 
known, p — 1 is a power of 2 times a small factor, because p = h ■ 2M -|- 1, so this test is well 
suited here. 

With that in mind, we can precompute values for p for all possible lengths N, since our supported 
hardware is 64-bit and hence N < 2®^ and (assuming M « N/ log^ N) M < 2^^. 

3.2.3 Computing the Root of Unity p 

In Step 2 we computed a prime p = h ■ 2M -|- 1, /i E N. Now we want to find a 2M-th root of 
unity oj in Z/p^Z. A generator of F* = {1, 2,... ,p — 1} has order p — 1 = h ■ 2M. Hence f 

^Over the years, progress has been made in determining the size of Linnik’s constant L. A recent work by 
Xylouris [Xylll] shows that L < 5. 
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is a primitive {p — l)-th root of unity and a primitive 2M-th root of unity in Z/pZ. In fact, 
both and C,^ are even principal. The following theorem allows us to find roots in 'L/p^'L for 
integer values s > 2 : 

Theorem (Hensel Lifting [NZM91, sec. 2.6]). Let f E Z[x] and let € Z 6 e a solution to 
f{x) = 0 (mod p^), such that f'{Cs) Is a unit in Ijlplj. Then Cs+i := Cs “ f iCs)/f'(Cs) solves 
f{x) = 0 (mod and furthermore = Cs (mod p®). 

Finding a primitive (p — l)-th root of unity in Z/p^Z means solving f{x) = x^~^ — 1. We 
can use Hensel lifting, because f'{Cs) = {p — is a unit in Z/pZ, since p — 1 / 0 and 

^ 0 (mod p). If we start with x = (" as solution to /(x) = 0 (mod p), then repeated 
lifting yields a (p — l)-th root of unity m Z/p^Z. Hence oj := Cz is a 2M-th root of unity in 
Z/p^Z. To see that oj is also primitive, let j E [1 ; 2M — 1]. Then ^ 1 (mod p), 

as C is a primitive (p — l)-th root of unity in Z/pZ. 

To prove that oj is even principal note that the characteristic of 7?. is p^, so w has to be coprime 
to p^, that is, coprime to p. But w = Cz = ^ 0 (mod p), so w is not a multiple of p. Hence 

uj and p^ are coprime. Furthermore, it holds for j E [1 : 2M — 1] that 


2M-1 


E 


1 - 00^“^^ 
l-UJ^ 


1 - {uj^^y 

1 — UJ^ 


0 (mod p^). 


because cj is a primitive 2M-th root of unity in Z/p^Z. 


□ 


We are looking for a principal 2M-th root of unity p E 7^ with the additional property p2M/2m _ 
a. Since TZ = V[a\/{a'^ + 1); « is a principal 2m-th root of unity. Denote 7 := 
a principal 2 m-th root of unity in V. Observe that 7* is a root of a™ + 1 = 0, for an odd 
i E [1 : 2m — 1], since (7*)™ = = (—1)* = —1. Because the 7* are pairwise different it 

follows that 

2m—1 

n (« - y) = a™ + 1. 

i=l 

i odd 


Theorem (Chinese Remainder Theorem [Fisll, sec. 2.11]). If R is a commutative ring with 
unity and R, ..., R are ideals of R, which are pairwise coprime (that is, R + Ij = R, for 
i j), then the mapping 


(j) : R^ R/Ii X ... X R/Ik, 

X (x + /i,... ,x + 4) 

is surjective and ker cp = R ■ ... ■ R- Especially, (p{x) = (^(x') x — x' £ R ■ ... ■ R and 


R/{R ■ ... ■ R) = R/R X ... X R/R. 


If the ideals R are generated by (0 — 7 *), they are pairwise coprime, since 7 * — 7 -? is a unit in 
IZ, for i 7 ^ j, see (3.3) below. So a E 7^[a]/(a”^ + 1) is isomorphic to the A:-tuple of remainders 
( 7 , 7 ^,... , 7 ^™“^) E YiiyPlcy/Ii). We are looking for a p satisfying = a, but we already 

know that = 7 , hence (a;,a;^,... ,uj‘^'^~^) is the tuple of remainders isomorphic to p. To 

regain p E 7^[a]/(Q:”^ + 1) we use the next 

Theorem (Lagrange Interpolation). Let R be a commutative ring with unity. Given a set of 
k data points {(xi,pi), ..., {xk,yk)} with {xi,yi) E Rx R, where the Xj are pairwise different 
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and Xi — Xj is a unit for all i / j, there exists a polynomial L{x) of degree less than k passing 
through all k points {xi,yi). This polynomial is given by 


k 

Hx) :=J2yMx), 

i=l 


where 


k 

:= 


i=i 

iAi 


X — Xj 
Xi — Xj 


In our case we know that p{a) = ^), 

{( 7 , 0 ;), and hence 

2m—1 

p(a) := ^ uj^£i(a), where 

i=l 
i odd 


so it follows that the set of data points is 


2m—1 j 

:= TT ( 3 . 2 ) 

/ii r-7^ 
iAi 

j odd 


The inverses to 7 * — 7 -^ exist. To see why, observe that an element of Z/p^Z is a unit if and only 
if it is not divisible by p. But 

y _ y = y(p-l)/2m _ y(p-l)/2m 

_ y(p-l)/2m _ y(p-l)/2m (3 3) 

^ 0 (mod p), 


because C is a primitive (p — l)-th root of unity and i, j G [1 : 2m — 1] and since i j the two 
exponents of C are different. □ 


3.2.4 Distribution of Input Bits 

We want to encode a nonnegative integer a < 2^ as polynomial over TZ[x\ with degree-bound 
M. We already calculated u = [2A^/Mm], the number of bits per inner coefficient. First, a is 
split into M blocks of um/2 bits each, starting at the lowest bit position. Each of these blocks 
encodes one outer coefficient. Since Mum/2 > N, we might need to zero-pad a at the top. 

Then, each of the M outer coefficient blocks is broken into m/2 blocks, each u bits wide. They 
form the inner coefficients. Since the inner coefficients describe a polynomial with degree-bound 
m, the upper half of the coefficients is set to zero. 

Finally, set the upper M outer coefficients to zero to stretch a(x) to degree-bound 2M. Figure 14 
depicts this process. 


3.2.5 Performing the FFT 

Section 2.6 described a radix-2 Cooley-Tukey FFT. The DKSS algorithm uses an FFT with a 
higher radix, but still the same basic concept. A Cooley-Tukey FFT works for any length that 
is a power of 2, here the length is 2M and it can be split as 2M = 2m ■ p, with p = 2M/2m. 

The DKSS algorithms uses a radix-// decimation in time Cooley-Tukey FFT (cf. [DV90, sec. 4.1]), 
that is, it first does p FFTs of length 2m, then multiplies the results by “twiddle factors” and 
finally performs 2m FFTs of length p. We can exploit the fact that the length-2m FFT uses 
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Y • u bits 


Figure 14: Encoding an input integer as a polynomial over TZ 


a as root of unity, since multiplications with powers of a can be performed as cyclic shifts and 
are thus cheap. 

We now describe the process formally. By construction, a{x) S TZ[x] is a polynomial with degree- 
bound 2M and p €lZ is a principal 2M-th root of unity. Bear in mind that p 2 M/ 2 m — pfj, — 
Since = —1, a is a primitive 2m-th root of unity in IZ. We can compute the length-2M 
DFT of a{x) with p as root of unity in three steps: 


i. Perform inner DFTs.^ 

Figure 15 shows the input vector a, which contains the coefficients of the polynomial a(x). 
The arrow indicates the ordering of the elements for the DFT. 

2M = 2m ■ p elements 

Oo Oi ... a^-i O-fM+l ■■■ a(2m-l)/i-l fl(2m-l)/j fl(2m-l)/j+l ■■■ a2m/j-l 

Figure 15: Input vector a 

Rewrite the input vector a as 2m rows of p columns and perform FFTs on the columns, 
see Figure 16. The boxes hold the values of vectors called e^, while the arrows indicate 
the ordering of their elements. 


p columns 


f -\ 

O/i-l 
«2/x-I 

Q-2m/i—1 
_./ 

= eo = ei = e^_i 



> 2m rows 


Figure 16: Input vector a written as p column vectors of 2m elements 


Wlease note that the inner and outer DFTs have no relation to the inner or outer coejficients. 
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We now define polynomials at,(x), which are residues of modular division. We will show 
that they can be calculated by performing DFTs on the e^. 

Let u € [0 : 2m — 1] and define polynomials a.u{x) € Tl[x] with degree-bound /i as 

a„(x) := a(x) mod — a^). (3.4) 

Denote aj E TZ the j-th coefficient of a(x), let E [0 ; ^ — 1] and define e£{y) E 7^[y] as 

2 m—1 

:= ■ y^- 

j=0 

That is, the j-th coefficient of e£{y) is the {jfi -|-^)-th coefficient of a(x), and ei{y) is a 
polynomial over TZ with degree-bound 2m. 

To calculate a„(x), write it out: 

a„(x) = a{x) mod {x^ — a^) 

= (ao + aix -I- ... -I- 

afj,xf" + ... -h a2|^-lx‘^^~^ + 

a2£iX^^ + a2^+ix^^+^ -h ... V 

... -I- a 2 M-ix^^~^) mod (x^ — a^). 

Since x^ = (mod x^ — a^), replace x^ with and get 

a^(x) = ao -I- aix a^_ix^“^-|- 

-I- a^+ia^'x -I- ... -|- a2^-ia^x^“^-l- 
a2fia‘^'’ + a2fj,+ia‘^''x -h ... -b a3^_ia^’'x'"“^-b 
• • • + a2M-ia ' x^ . 

Denote the .^-th coefficient of a^(x). Adding up coefficients of matching powers of x 
yields 

2 m—1 

^ ^ O.'jfl+i * • 

j=0 

Compare this to (3.5) to see that 

av,e = 6^(0^). 

So to find the £-th coefficient of each ay{x) we can perform a length-2m DFT of ei{y), 
using a as root of unity. Call these the inner DFTs. If we let i run through its y possible 
values, we get the coefficients of all a„(x). Figure 17 shows the result of the inner DFTs. 
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Oo 

hi 





0 -0,0 

Oo,l 

oo,u-i 


Ol,0 

ai,i 

Ol,/i-l 


02m-l 


Q2m—1,0 0‘2m—l,l • • • Q2m—1,^—1 ] 


Figure 17: Result of inner DFTs as 2m row vectors of g elements 


Multiplications by powers of a can be performed as cyclic shifts. Since = —1, coeffi¬ 
cients of powers > m wrap around with changed sign. This works much in the same way 
as the integer 2 in Schonhage and Strassen’s multiplication algorithm in Section 2.9. 

ii. Perform bad multiplications. 

What De, Kurnr, Saha and Saptharishi call bad multiplications is known as multiplications 
by twiddle factors in the Cooley-Tukey FFT. 

Our goal is to compute the DFT of a{x) with p as 2M-th root of unity, that is, to compute 
a(/?*), for i S [0 : 2M—1]. Express iasi = 2m-f+v with / S [0 : p—l] and n € [0 : 2m—1]. 
Then 

a(p') = a(p2-/+v) = (3,6) 

because according to (3.4) 

mod 

'-V-' 


with ^ = 


We already computed the polynomials a. 
we define 

dy{x) := dy{x ■ p""), (3.7) 

so that if ai,(x) is evaluated at x = 

Computing a„(x) can be done by computing its coefficients ay^£ = dy^i ■ p^^, with ^ E [0 : 
^—1]. Since coefficients are themselves polynomials, use Kronecker-Schonhage substitution 
as described in Section 3.2.6 to efficiently multiply them. 

hi. Perform outer DFTs. 

Now all that is left is to evaluate the ay{x), n E [0 : 2m — 1], at x = p^^'^, for / E [0:^ — 1]. 
In Step ii we arranged a^(x) in such a way that this evaluation is nothing but a length-/x 
DFT of a„(x) with as root of unity. Call these the outer DFTs. They are depicted in 
Figure 18. 


{p‘^^-f+vY _ 



o.t 


(x) in Step i above. In order to efficiently compute 


^ Since a — b (mod c) means a — b = kc, for some k G Z, a — b (mod 0) means equality. 













42 


Chapter 3. The DKSS Algorithm 


a-o — 

ai = 

02m-l = 


[ Q0,0 


00,1 




0-1,0 Ol,l 


Ol,/J.-l 


02m—1,0 02m—l,l ■ ■ ■ 02m—l,fi—l 


Figure 18: Outer DFTs on 2m row vectors of g elements 

If M > m this is done by a recursive call to the FFT routine and according to (3.6) and 
(3.7) computes ay{p^^'^) = a„(/ 0 ^™'''^+'") = a(/3^”*'-^+'") = a(p*). 

If M < m, just computing an inner DFT with as (2m/2M)-th root of unity is 

sufficient. 


3.2.6 Componentwise Multiplication 

Multiply coefficients Oj by hi to compute 2M product coefficients Cj := Oj5j. Since coefficients 
are from Tt and are thus themselves polynomials, we use Kronecker-Schonhage substitution (cf. 
[Sch82, sec. 2], [BZll, sec. 1.3 & 1.9]) to multiply them and reduce polynomial multiplication 
to integer multiplication. Then we can use the DKSS algorithm recursively. 

Definition (Kronecker-Schonhage substitution). Kronecker-Schonhage substitution reduces 
polynomial multiplication to integer multiplication. Since IZ = V[a\/+ 1) consists of polyno¬ 
mials with degree-bound m, whose coefficients are in V = Z/p^Z, each coefficient can be stored 
in d := [logp^] hits. Coefficients are to be multiplied, so 2d bits per coefficient product must he 
allocated to prevent overflow. Furthermore, multiplication of two polynomials with degree-hound 
m leads to m summands for the middle coefficients, thus another logm bits per coefficient are 
required. 

This substitution converts elements of TZ into integers that are m{2d logm) bits long. Then 
these integers are multiplied and from the result the product polynomial is recovered. 


3.2.7 Backwards FFT 

The backwards FFT works exactly like the forward FFT described in Step 5. We use in fact an 
inverse FFT and reordering and scaling of the resulting coefficients is handled in the next step. 


3.2.8 Carry Propagation 

In Step 4, we encoded an input number a into the polynomial a{x) by putting Mm/2 bits into each 
outer coefficient and from there distributing u bits into each of the m/2 lower inner coefficients. 
When decoding the product polynomial c(x) into the number c, we must use the same weight 
as for encoding, so we evaluate the inner coefficients at a = 2“ and the outer coefficients at 
X = Of course, on a binary computer this evaluation can be done by bit-shifting and 

addition. 

We must take the ordering of the resulting coefficients into account. In Section 2.7 we defined 
a backwards transform to get results that are properly ordered. However, for simplicity of 
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implementation, we use again a forward transform and access its resulting coefficients in different 
order. 

Furthermore, all result coefficients are scaled by a factor of 2M, so we have to divide them by 
2M prior to addition. 


3.3 Run-time Analysis 

3.3.1 Analysis of each Step 

Our goal is to find an upper bound to the bit complexity T{N) needed to multiply two non¬ 
negative A-bit integers using the implementation of DKSS multiplication to get their 2A-bit 
product. We estimate the run-time of each step individually. 


1. Choosing M and m does only depend on the length of the input and can be done in 
constant time. 

2. Computing u takes constant time as does finding p, since we precomputed all values for p 
for the supported hardware. Thus, this step has cost 0(1) as well. 

3. In this step we compute a 2M-th root of unity p &TZ from a known generator of F*. T-p 
denotes the time to multiply two arbitrary numbers in V. First, we use Hensel lifting to 
calculate Q m z — 1 lifting steps. In each step we have to calculate 

C.+1 :=6-(cr'-i)-((p-i)cr^)-'. 

This can be done with 1 exponentiation, 3 multiplications, 4 subtractions and 1 modular 
inverse. 

To exponentiate, we use binary exponentiation [Knu97b, ch. 4.6.3], which requires O(logp) 
multiplications in V, and to find the modular inverse we use the extended Euclidean 
algorithm [Knu97b, ch. 4.5.2] with O(logp^) steps, where each step costs O(logp^). After 
lifting, we calculate a; = where h < p/2M. 

Together, the cost to calculate u is 

T^ = {z- l){0{\ogp)Tv + + 40{logp^) + 0{logp'')0{logp^)) + 

O {log p)Tp 

= 0{logp ■ Tp -F log^p^). 

After that, we perform Lagrange interpolation: according to (3.2) it consists of m additions 
of polynomials in IZ, each of which is computed by m — 1 multiplications of degree-I 
polynomials with polynomials in IZ plus m — 1 modular inverses in V. 

Thus, the run-time for Lagrange interpolation is 

Tlagrange = m(mO(logp^) + {m - l){2mTp + 0(logp'^ • logp^))) 

= 0(m^(logp^ + mTp + log^p^)) 

= 0{m^{mTp -|- log^p^)). 
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Ordinary multiplication can multiply n-bit integers in run-time hence Tp can be 

bounded by O(log^p^). Using (3.1) we estimate = 0{N^^/ N) and recall that 

m = 0{logN). We get as total time to compute p: 

Tp — T(j -|- Ti^agrange 

= 0{logp ■ T-p + log^p^)) -|- 0{w?{mTp -|- log^p^)) 

= 0(logp • O(log^p^) -|- log^p^ -|- m^(mO(log^p^) -|- log^p^)) 

= 0( logp • log^p^ -|- m^(mlog^p^ -|- log^p^)) 

= 0((logp -|- m?) log^p^) 

= 0{{\og{N^/ \og^^ A^) + log3 N) log2(iV-^7log2'^^ N)) 

= 0{\og^N Aog^ N) 

= ©(log*^ A^). 

4. Encoding input numbers as polynomials can be done in time proportional to the length 
of the numbers, that is, in time 0{N). 

5. As we will see, the FFT is one of the two most time-consuming steps; the other one being 
the multiplication of sample values. Let us first evaluate the run-time of a length-2M 
FFT over IZ, denoted by T£){2M). We analyze the run-time of each step of the FFT 
individually. Tp denotes the time needed to multiply two arbitrary elements of IZ and 
will be specified later. 

i. The first step performs p = 2M/2m inner FFTs over TZ of length 2m. To calculate 
one DFT we need to perform 2mlog(2m) additions and mlog(2m) multiplications 
by powers of a, cf. (2.4). A single addition costs 0(mlogp^), since an element of 
7^ is a polynomial over V = TLjp^TL with degree-bound m. Since multiplication by a 
power of a can be done with a cyclic shift, its run-time is of the same order as that 
of an addition. So the run-time to compute one inner DFT is 

3mlog(2m) • 0(m log7) = 0{vr? logm • logp^), 
and the run-time to compute all 2M/2m inner DFTs in this step is 
2M/2m ■ 0{m^logm ■ log7) = 0{Mm\ogm ■ log7). 

ii. Here, we prepare the 2m polynomials ay{x) for the outer DFTs. For each u S [0 : 
2m — 1], the polynomial ay{x) has p = 2M/2m coefficients, which makes a total of 
2M multiplications in IZ by powers of p to compute all ay{x). The same number 
of multiplications is needed to compute the powers of p. So this step has a total 
run-time of 4M • Tp. 

hi. This last step computes 2m outer DFTs. The FFT routine is invoked recursively to 
do this. The total run-time for this step is the time for 2m DFTs of length 2M/2m 
and hence 

2m ■ T]o{2M/2m). 

The recursion stops when the FFT length is < 2m, that is, after log2m{2M) levels. 
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The total run-time Td{2M) of the FFT is the sum of the run-times of all three steps, that 
is, 

Td(2M) = 0{Mm\ogm ■ \ogp^) -|- AM ■ + 2m ■ T£){2M/2m) 

= log2m(2M) • (0(Mm log m • logp^) -|- AM ■ T-ji). (3.8) 


6. Each of the 2M coefficient pairs a*, 6* can be multiplied in time T-j^. Thus, the run-time 
for this step is 2M ■ T-ji. 

7. The backwards FFT has the same cost as the forward FFT, see (3.8). 

8. Decoding the polynomials back into integers and performing carry propagation can be 
done with 2Mm additions of length logp^, hence with cost 


Tdecode 


0{2Mm log p^) 


0 { 


o 


N 


■ log N ■ log 


log^ N 
^ \ogN^^ 


log A/" 
0{N). 




3.3.2 Putting Everything Together 

To conclude our evaluation of the run-time, we need to upper bound the value of Tt^, the time 
needed to multiply two arbitrary elements of TZ. For that purpose, we use Kronecker-Schonhage 
substitution as described in Section 3.2.6. 

Theorem (Kronecker-Schonhage substitution). Multiplication in IZ can he reduced to integer 
multiplication of length 0(log^ A^) bits. 

This substitution converts elements of 7Z into integers of m{2d + logm) bits, with d = [logp^], 
multiplies the integers and from the integer result recovers the product polynomial. To see how 
large these integers get in terms of N, we use (3.1) and obtain 

m{2d -|- logm) = 0(log A^ • (2|'logp^] -|- log log A^)) 

= 0(log N ■ {logp^ + log log N)) 

= 0(log N ■ {log{N^^/ log^^^ A^) -|- log log A^)) 

= 0{logN ■ LzlogN) 

= 0{log^N). (3.9) 


T{N) denotes the time to multiply two A^-bit integers, so T-ji = r(0(log^ N)). 


□ 
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Adding up the run-time estimates of all steps we get the total run-time T{N) for DKSS multi¬ 
plication: 

T{N) = 0(1) + 0(1) + 0(log'^ N) + 0{N) + 2Td{2M) + 2MTti + Td{2M) + 0{N) 

= 3Td{2M) + 2MTti + 0{N) 

= 3( log2^(2M)(0(Mm log m • logp^) + AMTti)) + 2MTti + 0{N) 

= 0(log2m(2M)(Mmlogm • logp^ -|- MT-ji) + MT-ji + N) 

= 0(Mlog2^(2M)(mlogm • logp^ + r(0(log2 N))) + Mr(0(log2 N)) + N). 


In terms of N that is 


T{N) = 0( 


N log(2iV/log2 AT) 
log^iV log(21ogA^) 


(log N ■ log log N ■ log 


( 




+ 


r(c>(iog2 JV))) + -^r(o(iog2 n)) + n) 

" °( toTiv N-\ogN + T(0(log" AT))) + 

^r(0(iog^«)) + b 

= o{N>ogN + __J^T(0(log^ IV)) + i^r(0(log^ IV)) + IV) 


= o(A^logA^ + 


N 

log N ■ log log N 


r(o(iog2iv))). 


(3.10) 


3.3.3 Resolving the Recursion 


To solve the recursion, we will need the following estimation. Observe that for any real a: > 4 
it holds that 


log(Alog^x) log(A(logx)^) 
log log X log log X 


log A -|- 2 log log X 
log log X 


< log A -|- 2. 


(3.11) 


The following notation is introduced to abbreviate the upcoming nested logarithms: define 
fo{N) := N and fi{N) := A log^/i_i(A^), for z E N and some A. Furthermore, let r > 4 be 
the smallest length where the algorithm is used, otherwise a simpler algorithm is used. Now we 
express the run-time from (3.10) with explicit constants, assuming that Alog^ = fi{N) > r 
and unroll the recursion once: 


T{N) < fi{NlogN + 


N 

log N ■ log log N 


T(Alog2 N)) 


= fj,N log N (^1 + 
< iJ,N log N (^1 -|- 
= fiNlogN^l + 


TjXlog^N) X 
log^ N ■ log log N ' 

log^ N) log(A log^ N) _ r(Alog^(Alog^N)) 

log^ N ■ log log N log^(A log^ N) ■ log log(A log^ A^) 

^Alog(Alog^ A^) TjXlog'^ fi{N)) \ 

log log N log^ fi {N) • log log /i (N) ^ 
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Using (3.11) leads to 


T{N) < ^^N\ogN{l +nX{\ogX + 2){l + 


-■V 


r(AiogVi(iv)) 

\og^ fi{N) ■ loglog/i(iV) 


fjiN log n(^ 1 +r] + r] ■ 


T(AlogVi(iV)) 
logVi(^) • loglog/i(iV) 


Assuming Alog^ = f 2 {Xh) > r we unroll once more and get 


T{N) < fiN log A^(|l + 7/ + 

_ /7AlogVi(jV) • log(AlogVi(jV)) TjXlog^ f 2 {N)) 

^ log^ fi {N) ■ log log fi {N) log^ f 2 {N) ■ log log /2 (N) 

Again canceling out and using (3.11) gives 


T{N) < iJ,N log N(^1 + rj + r] /iA(log A + 2)(1 + 


=v 


r(AiogV2(iv)) 

logV2(A^)log log/2(A^) 


AM ^ 2^ 2 r(AlogV2(iV)) A 

n ogNil rj 7] fj ' 2 f 1 f 


z 

giN log A^ f + 7/ 


i=0 


log / 2 (A 7 ) loglog/2(A^)- 
2 T(AlogV2(iV)) 
logV2(A^)log log/2(A^) 


Obviously, after unrolling j E Nq levels of recursion and assuming fj{N) > r we get 

T{N) < y 7 iV log Ar( ^ 7 ?* + • nX log^ f,{N)) x _ 

V ^ log2 fj (AT) log log fjiN)^ 


i=0 


(3.12) 


The remaining question is now: how many levels of recursion are there for a given N? To find 
out, we look for a lower bound for N after j levels of recursion. 

Equation (3.12) applies if fj{N) > r. If j > 1 we can reduce fj{N) once and get 

fjiN)>T 

AlogVi-i(A^) > r 
log^ fj-i{N) > r/A 

^ogfj-i{N) > -^tTA 

fj-i{N) > Txf^. (3.13) 

A second reduction works quite like the first, assuming j > 2: 

AlogVi-2(iV) >2^^ 

log/,_2(A^) > \[W^X 
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fj.2{N) > 

Transforming the exponent we get 

\J2V^/X = ^(2\/iA)V^/A = 7(2ViA)x/^/VA = j^fX. 

=:/3 

Now use that and reduce again, assuming j > 3: 

/,_2(Af) > 

AlogVi-3(iV) 

/,_3(iV) > 1'P^\ 

Transforming the exponent again gives 

^2/J'^/Va/a = ^(2^)/3^/A = \l{2V^f^ IVX = (2^)^^'^/v/A = /'^/VA, 

=/3 


which yields 


/,_3(A^) > 2^'’'^/^. (3.14) 

So we see that with each unroll step of fj{N) we get another exponentiation by f3 in the 
exponent. 

Definition (Iterated Exponentials). Let a, x G M, n E No and denote exp„(x) = , then 

{ X if n = 0 

, n-l/ NX V n 

exp„(exp^ (x)) ifn>0 

is called iterated exponentials or power tower. For example, exp^(x) = . This notation is 

inspired by Euler’s exp(x) function and functional iteration in [CLRS09, p. 58]. 

With the help of iterated exponentials we can write (3.14) as 

fj-siN) > 2'^-P|(^^)/^, 


and if we reduce fj{N) fully we get 


N = fo{N) > 2""^^'^ 


(3.15) 


We are close to the goal, which we can attain with help of the following 


Definition (Iterated Logarithm [CLRS09, p. 58]). Let a, x G M>o, 
is defined as 


loga(a;) 


0 */x < 1 

loga(logaX) + 1 Z/X>1 


then the iterated logarithm 


(3.16) 


and is the inverse o/exp”(l), that is, log*(exp”(l)) = n. The iterated logarithm is the number 
of logg^-operations needed to bring its argument to a value < 1. As usual, log* x := log^x. 
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Now we use the iterated logarithm on (3.15) and get 

N > 

logiV > exp^“^(v^)/\/A 
'/XlogN > exp^“^(-v/T) 
log^( VA log A^) > j - 1 + log^ 

log^(\/AlogiV) + 1 - log^ Vr > j. (3.17) 

We can replace log^ x by 0(log2x). To see why, observe that if /3 could be expressed as some 
power tower of 2, say, /3 = 2^^, that is, log* /3 = 3, then a power tower of (3 is less than one of 2 
with thrice the length, because = (2^^)^ < 2^^^ '. Hence, log^ x < log* x ■ log* (3 = 0(log* x), 
since (3 is constant. 

Since only is a variable, this finally leads to the estimate 

j < log^(\/Alog A^) + 1 - log^ ^/T 
= 0(log^(\/AlogW)) 

= 0{log}N) 

= 0{log*N). (3.18) 


Now, we can pick up (3.12) again. We assume rj ^ 1, /j(W) > t, but fj^i{N) < r and hence in 
analogy to (3.13), fj{N) < 2V^^. Then we get 


T{N) < nN log Nri^ + 7f 
< fiN log N 


i=0 
Tji+i _ 1 

r] — 1 


+ rj- 


r(AiogV,(iv)) 

logVi(A^)loglog/j(W) 

Tifnim 


logWi(A^) log log fj{N) 


Capturing the constants into Big-O’s yields 


r(A^) = fiN log N{0{r]^+^) + 0{rf)) 
T{N) =0{NlogN 

= A^logAr-7/^(i°g‘^). 


Expressing g and the constant from 0(...) as 2^^, for some constant k, we write 
T{N) = N log N- ( 2 ^)Oiiog*N) 

= A^-logAi-2‘^('°§*^l. (3.19) 


3.4 Differences to DKSS Paper 

The intention of this thesis is to assess the speed of an implementation of DKSS multiplication 
on a modern computer. Its architecture imposes certain limits on its software. For example, 
the amount of memory that can be addressed is limited by the size of the processor’s index 
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registers. A more compelling limit is that the universe contains only a finite amount of matter 
and energy, as far as we know. A computer will need at least one electron per bit and thus, 
even if we could harness all (dark) matter and energy for memory bits, any storable number 
could surely not exceed 2^*^^ bits in length. 

Another limit creeps in with the speed of the machine: there is no practical use to provide a 
solution that will run several thousand years or more to complete. An estimation of the run-time 
to multiply numbers with 2®® bits leads to a minimum of 7000 years on the test machine. 

This led me to assume a maximum length of input numbers. Since the implementation runs 
on a 64-bit CPU, the number’s length is de facto limited to 8 • 2®^/4 = 2®® bits. And since the 
length is limited, we can precompute some constants needed in the algorithm, namely the prime 
p and a generator of F*. I did this for values of p with 2 to 1704 bits in length. 

De, Kurur, Saha and Saptharishi went to great lengths to show that suitable primes p can be 
found at run-time and to make their construction work, they use p^ as modulus, z > 1, as we 
have seen in Sections 3.2.2 and 3.2.3. 

Furthermore, they encode input numbers as fc-variate polynomials, where the degree in each 
variable is < 2M. That is, outer polynomials are in 1Z\X\,... ,Xk]. When it comes to the 
FFT, they fix one variable, say X^, and treat the outer polynomials as univariate polynomials 
over S := 77[Ai,..., Afc_i]. Note that p is a principal 2M-th root of unity in S as well. 
Then they perform FFT multiplication of a univariate polynomial over S. The componentwise 
multiplication uses FFT multiplication recursively, because now two {k — l)-variate polynomials 
have to be multiplied. 

Since the only need for fc-variate polynomials was to show that p can be found at run-time, I 
was able to use k = \ and use univariate polynomials in the implementation. Furthermore, it 
was easy to precompute p to greater sizes, so there was no need for z > 1 and thus I dropped 
Hensel lifting to find as well. 


I changed some variable names from [DKSS 13] to avoid confusion with other variables of the 
same name or to improve clarity. If the reader is familiar with the original paper, here is a small 
overview of changed names: 


Description 

DKSS paper 

This thesis 

Exponent of prime p in modulus 

c 

z 

Number of variables for outer polynomials 

k 

(dropped, k = 1) 

Factor in progression for finding prime p 

i 

h 

Residue polynomials in DFT 

Uj 

dy 

Index variable in DFT 

k 

f 

Radix of FFT 

2M/2m 








Chapter 4 


Implementation of DKSS 
Multiplication 


In this chapter my implementation of DKSS multiplication is presented. Parameter selection is 
discussed and exemplary source code is shown, together with a description of tests performed 
to assert the software’s correctness. Then, measured execution time, memory requirements and 
source code size is examined. I discuss the results of profiling and lastly, extrapolate run-time 
for increasing input lengths. 


4.1 Parameter Selection 


The description of parameter selection in Section 3.2 leaves some freedom on how exactly to 
calculate M, m, u and p. Recall that we are performing FFTs of polynomials with degree- 
bound 2M in TZ[x\, where TZ = 'P[a]/{a'^ + 1) and V = Z/p^Z. We call coefficients in TZ[x\ 
outer coefficients and coefficients in V[a] inner coefficients. Both input numbers have N bits, 
parameter u is the number of bits of the input number that go into each inner coefficient and z 
is constant. 

I aimed at a monotonically increasing graph of execution time, that is, growing input lengths 
lead to growing execution times. Parameter selection that leads to a rough graph suggests that 
better parameters could be selected. 

This led me to choose the prime p first. Section 3.2.2 mentions lower bounds for p^. Recall that 
M Ki N/ log^ N and m « log N. I use 

> ^Mm22“ Ri liY^/logA^. (4.1) 

Furthermore, I decided to round up the number of bits of p to the next multiple of the word 
size. Since both allocated memory and cost of division (for modular reductions) depend on the 
number of words, it seemed prudent to make the most out of it. Benchmarks show that this 
was a good choice, see Figure 19 for a graph of timings. 

DKSS multiplication uses V = Z/p^Z with z > 1 to lower run-time in the asymptotic case 
by lowering the upper bound for finding the prime p. But that doesn’t apply here, since the 
machine this implementation runs on enforces upper limits of the length of numbers. So despite 
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the description of the process of Hensel lifting in Section 3.2.3, I did not implement it, because 
precomputation of larger prime numbers was the easier choice (see Linnik’s Theorem on page 36). 
Furthermore, the special build of Froth prime numbers could be exploited in the future to speed 
up modular reductions. 

Having chosen p, I then select the largest u that is able to hold the whole 2N bits of the result. 
It follows from (4.1) that log(p^) > log(Mm) + 2u — 1. Since log(p^) is chosen first, I try to 
maximize u. The larger u is, the less coefficients are needed. After finding an u that fits, I try 
to minimize the product Mm, because the smaller Mm is, the smaller the FFT length and the 
memory requirements are. 

Lastly, I set M and m and try to maintain the quotient M/m ~ N/log^ N that follows from 
the description in Section 3.2.1. On the other hand, factors can be moved around between M 
and m, since in selection of u and p only the product Mm is needed. I did some short tests on 
selecting M/m ~ k ■ N/log^ N for some k, but it seemed that k = 1 was overall a good choice. 


4.2 A Look at the Code 


If the parameters are given (namely M, m, u, and p), the main idea of DKSS multiplica¬ 
tion lies in the structure of the ring TZ and the way the DFT is computed: inner DFTs, bad 
multiplications and outer DFTs. 

To give an impression of the implementation, following is the FFT main routine. Language 
constructs (like templates and typedefs), debug code and assertions were stripped to improve 
readability. As mentioned in Section 2.2, tape_alloc is a stack-like memory allocator. It takes 
the number of words requested as argument. 


void dkss_fft ( 


word* a. 


// 

unsigned 

M , 


unsigned 

m , 


unsigned 

oclen , 

// 

unsigned 

iclen , 

// 

word* pz, 


// 

word* rho 

_pow , 

// 

unsigned 

base_pow) 

// 


{ 


input vector 

outer coeff length = m * 
inner coeff length >= bi 
modulus p ''z 

powers [0 : m-1] of \rho 
quotient of order of top 


iclen 

t_l eng th (p ^z) / hits (word) 

level \rho and now 


if (M <= m) { // use inner DFT 

tape_alloc tmp(oclen); 
word * t = tmp.p; 
unsigned log_M = int_log2(M); 
fft_shuffle (a , log_M + 1, oclen); 
dkss_inner_fft_eval(a, M, m, oclen, 
return ; 


right away 

// allocate some memory from the "tape 
// length oclen 

// pre-shuffle the values 
iclen , pz , t) ; 


} 


unsigned mu = M / m; // \mu = 2M/2m 

unsigned log2m = int_log2(m) + 1; 

tape_alloc tmp(2*M * oclen + 2*m * oclen + oclen); 

word* abar = tmp.p; // length 2*M*oclen, \bar{a} 

word* el = abar + 2*M * oclen; // length 2*m*oclen, e_\ell 

word* t = el + 2*m*oclen; // length oclen, temp storage 

// perform inner DFTs 

// i guess it ^s better to copy elements instead of using pointers and work 
// in-place , because this way cache thrashing can only occur once when 
// copying and not on every access. 
for (unsigned 1=0; l<mu; ++1) { 

// assemble e_l(y): 


// cycle through all values for I 
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// the j-th coeff of e_l(y) is the (j*mu+l)-th coeff of a(x) 

// for the FFT evaluation, we assemble e_l(y) already in shuffled order 
word* a_jxl = a + l*oclen; // points to a_l 

for (unsigned j=0; j<2*m; ++j) { 

word* el_j = el + bit_rev(j, log2m) * oclen; 
copy(el_j, a_jxl, oclen); 

a_jxl += mu * oclen; // point to next a_{j*\mu+l} 

} 


// perform inner DFT on e_l(y) with alpha as 2m-th root of unity 
dkss_inner_fft_eval(el, m, m, oclen, iclen , pz , t); 

// l-th coeffs of all a_v(x) is e_l (alpha''v), i.e. v-th coeff of DFT(e_l) 
// this copies transformed elements back into place 
word* el_v = el; 

word* abar_vl = abar + l*oclen; 
for (unsigned v=0; v<2*m; ++v) { 
copy(abar_vl, el_v, oclen); 
el_v += oclen; 
abar_vl += mu * oclen; 

} 

} 

// perform bad muls and outer DFTs 
word* abar_v = abar; 

word* rho_vl = t; // just for the name 

const index top_mu = mu * base_pow ; // top level mu 

unsigned psh = int_log2(top_mu); // shift count 

// cycle through all a_v to perform bad muls and outer DFTs 
for (unsigned v=0; v<2*m; ++v) { 

// skip first loop iteration: v == 0, i.e. abar_{v,l} *= rho''0 = 1 
word* abar_vl = abar_v; 
unsigned vlbase = 0; 

for (unsigned 1=1; l<mu; ++1) { // cycle thru all values for I 

vlbase += v * base_pow; 
abar_vl += oclen; 

unsigned pi = vlbase & ((1 << psh) - 1); // vlbase % top_mu 

unsigned pe = vlbase >> psh; // vlbase / top_mu 

// select right rho_pow and do cyclic shift 

modpoly_mul_xpow_mod_mpl(rho_vl, rho_pow + pi*oclen, pe, m, iclen, pz); 
// abar_ {v, 1} *= rho ''{vl} 

modpoly_mul_mod_mp1(abar_vl , abar_vl , rho_vl , m , iclen , pz ) ; 

} 


// now abar_v contains \tilde{a}_v. ready to do outer DFT: recursive call 
dkss_fft(abar_v, mu/2, m, oclen, iclen, pz , rho_pow , base_pow * 2*m) ; 

// copy back to ’a’ array 
word* a_fxv = a + v * oclen; 
word* abar_vf = abar_v; 
for (unsigned f=0; f<mu; ++f) { 
copy(a_fxv, abar_vf , oclen); 
abar_vf += oclen; 
a_fxv += 2*m * oclen; 

} 


} 


} 


abar_v 


mu * oclen; 


The listing shows one of the few optimizations I was able to implement: in the run-time anal¬ 
ysis in Section 3.3, Step 5.ii we counted 2M multiplications by powers of p and another 2M 
multiplications to compute those powers. I was able to reduce the number of multiplications 
for the latter from 2M io p = 2M/2m. 

I used the fact that p 2 M/ 2 m _ p^i _ if i E [0 : 2M — 1], set r := Yi/p\ and s '.= i mod p and 
thus i = rp + s. 
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Therefore it holds that p* = We can obtain /?* with an additional cyclic 

shift by precomputing all s E [0 : ^ — 1]. In benchmarks, this almost halved the run-time. 

In the above listing function dkss_inner_fft_eval() is called. This function doesn’t dif¬ 
fer much from the QMUL FFT evaluate function qmul_evaluate() on page 20, except that 
this time functions instead of operators are used to add and subtract elements, and multipli¬ 
cations by powers of the root of unity are done by cyclic shifts. Following is the listing of 
dkss_inner_fft_eval0: 


void dkss_inner_fft_eval ( 
word* e, 

unsigned n_half , 
unsigned m, 
unsigned oclen, 
unsigned iclen , 
word* pz, 
word* t) 


// input vector 
// half of FFT length 

// outer coeff length = m * iclen 

// inner coeff length >= bit_length(pz) / bits(word) 
// p 

// temp storage 


if (n_half == 1) { 

// lowest layer: butterfly of two outer coeffs, 

// i.e. add and sub of two inner polynomials 

word* e2 = e + oclen; // second inner polynomial 

copy(t, e2, oclen); 

modpoly_sub(e2, e, t, m, iclen, pz); // e2 = e - t 
modpoly_add(e, e, t, m, iclen, pz); // e = e + t 

return ; 


} 


dkss_inner_fft_eval(e, n_half/2, m, oclen, iclen, pz , t); 

dkss_inner_fft_eval(e + n_half*oclen, n_half/2, m, oclen, iclen, pz , t); 


unsigned inc = m / 

n_half; 


// 

increment for 

each loop 

word* el = e; 



// 

first inner p 

olynomial 

word* e2 = e + n_half*oclen; 


// 

second inner 

p 0 lynomia 1 

unsigned pow = 0; 
for (unsigned i=0; 

i<n_half; ++i) 

{ 




// w = omega_n''i 

, t = w*e2 





modpoly_mul_xpow 

_mod_mpl(t, e2, 

pow , 

m, iclen, pz); // 

cyclic shift 

modpoly_sub (e2 , 

el, t, m, iclen 

. pz): 

; // 

e2 = el - t 


modpoly_add(el , 

el, t, m, iclen 

, pz) : 

; // 

el = el + t 



el += oclen; 
e2 += oclen; 
pow += inc ; 

} 

} 


4.3 Asserting the Code’s Correctness 


Development included writing a lot of test code. Every major function has some unit tests 
following it. The unit tests usually contain fixed data to be processed by the function to be tested 
and compare its output to results that are known to be correct, since they were computed by 
other means: Python programs were used to compute the correct results for FFTs in polynomial 
quotient rings, a different method for multiplication was used to test DKSS multiplication, and 
sometimes the correct results were more or less obvious and could be hard-coded by hand. 

Additionally, functions contain assertions (like C++’s assertO), which are assumptions that 
are written together with the (proper) code and are checked at run-time. Often, these are pre- 
and post-conditions of functions. Some asserts call functions that were solely written for use 
in assertions, like a test for primitivity of a root. 
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To have the best of both worlds, code can be compiled in Debug or Release mode with Visual 
Studio. Release builds have all asserts disabled and are compiled with optimizations for maxi¬ 
mum speed, while Debug builds feature assertion checking, but code optimization is disabled to 
aid debugging. Test code is usually run in Debug mode, while benchmarks are run in Release 
mode. 


Furthermore, after development of DKSS multiplication was completed, it was integrated into 
my framework of long integer routines that is maintained as a private project. This framework 
is used for primality testing of Mersenne numbers (numbers of the form 2^ — 1). Of course, it 
can not compare to the Great Internet Mersenne Prime Seareh [GIMPS], the distributed effort 
to find new Mersenne prime numbers that is going on since 1996 and has found the last eleven 
record prime numbers. 

Nevertheless, I have been checking Mersenne numbers for primality for over two years now and 
a database exists of the low 64 bits of the result (called the residue) for each Mersenne number. 
The primality test used for Mersenne numbers is the Lueas-Lehmer test [CP05, ch. 4.2.1]. It 
consists of a loop of a long integer square, a subtraction by 2 and a modular reduction. The 
nature of this test causes even single-bit errors to proliferate, so any error would most likely 
alter the residue as well. Since it is hard to test all code paths with unit tests this makes it a 
good way to test a multiplication routine. 

As a system test DKSS multiplication was used in Mersenne number primality tests and its 
results were compared against existing results. The first 35 Mersenne primes (the largest being 
21398 269 _ -were correctly identified as such. Furthermore, all Mersenne numbers 2^ — 1 with 
p < 120 607 and various other sizes were tested and the residues matched. 

4.4 Execution Time 


Our main interest is to find out how fast DKSS multiplication is in comparison to other, well 
established algorithms. Except for small and medium lengths, Schonhage-Strassen multiplica¬ 
tion was the fastest algorithm that used all-integer methods in practice so far. I compare both 
implementations DKSS_MUL and SMUL to one another. 

Figure 19 shows graphs of DKSS_MUL and SMUL execution time (and Figure 24 shows some of the 
raw data). The cyan-colored and the magenta-colored graph show execution time if p was not 
rounded up to the next multiple of the word size, and if p was in fact rounded up, respectively 
(cf. Section 4.1). It is always faster to use a rounded up p than to use the “original” value. 

As can be seen clearly, DKSS_MUL is much slower (about 30 times) than SMUL (printed in green) 
over the whole range of tested input lengths. From this graph it is hard to see if DKSS_MUL is 
gaining on SMUL. Section 4.8 discusses the quotient of run-times and the location of a crossover 
point in detail. 

The stair-like graph stems from the fact that execution time almost totally depends on the FFT 
length 2M and the size of elements of 77 = 7^/(a”* -|- 1) with V = 'L/p^'L. Since both M and m 
are powers of 2, many different input lengths lead to the same set of parameters. 

The graph shows that execution time is almost the same for the beginning and the end of each 
step of the stair. The only part that depends directly on N is the encoding of the input numbers 
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Figure 19: Execution time of DKSS MUL 


and decoding into the resulting product. But the time needed to do the FFT clearly dominates 
overall execution time. 

In contrast to DKSS_MUL, the SMUL execution time graph is much smoother. In fact, it is repro¬ 
duced without marks that would otherwise only obscure the graph, because there are a total of 
12 969 data points available, of which 465 representative points are shown. 

Obviously, DKSS_MUL parameter selection could be improved, since sometimes larger input num¬ 
bers lead to faster execution times. Either, more research on parameter selection or a calibration 
process should smooth this out. 


4.5 Memory Requirements 


DKSS_MUL memory requirements are dominated by three times the size of the polynomials: input 
a{x) and h{x) G IZ\x\ and the a„(x). The result c{x) requires no further memory, since storage 
of one of the input polynomials can be reused. An improved implementation could save the 
av{x) directly back into the polynomial without need for temporary storage, thus saving one 
third of memory requirements. To accomplish that a fast matrix transposition is needed, which 
in itself is not trivial (cf. [Knu97a, exercise 1.3.3-12]). 

The polynomials each have 2M coefficients in 77 = V[oi\/{a^ -|- 1), where V = Z/p^Z. Hence, 
each polynomial needs 2Mm|'logp^] bits. With M ~ N/log^N, m ~ logAI and p^ ~ 
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Input length 
(words) 

DKSS_MUL memory 
(bytes) 

DKSS_MUL 

blow-up 

SMUL memory 
(bytes) 

SMUL 

blow-up 

Q 

3648 

803 584 

27.54 

251 848 

8.63 

3.19 

7168 

1623 040 

28.30 

501 704 

8.75 

3.24 

14 336 

3 228 672 

28.15 

962 728 

8.39 

3.35 

28160 

6 439 936 

28.59 

1855 528 

8.24 

3.47 

56 320 

12 862 464 

28.55 

3 693 672 

8.20 

3.48 

110 592 

25 707520 

29.06 

7 240136 

8.18 

3.55 

221184 

51422 464 

29.06 

14 331384 

8.10 

3.59 

434176 

102 819 072 

29.60 

28 372 232 

8.17 

3.62 

868 352 

205 612 288 

29.60 

56 716 552 

8.16 

3.63 

1703 936 

406 915 072 

29.85 

111269 224 

8.16 

3.66 

2 752 512 

616 798 080 

28.01 

178 538 880 

8.11 

3.45 

5 505 024 

1233 557376 

28.01 

361056 896 

8.20 

3.42 

10 878 976 

2 467113 216 

28.35 

705184 592 

8.10 

3.50 

21757 952 

4934174 976 

28.35 

1477143 728 

8.49 

3.34 

42 991616 

9 765 277440 

28.39 

2 819 507024 

8.20 

3.46 

85 983 232 

19 530 403 584 

28.39 

5 638 486 864 

8.20 

3.46 


Figure 20; Memory requirements of DKSS_MUL and SMUL 


^N^/logN (see (4.1)) that results in 

2Mm|'logp^] « 2N/ log^ N ■ log Af • log(-A^^/ log N) 

= 2N/\ogN ■ (—1 + 51og — log log A^) 

« lOAT. 

The listing of function dkss_fft() in Section 4.2 shows that more memory, namely another 
(2m + 1) • [logp^] lOlog^ N bits, is allocated, but compared to lOA^ bits for each polynomial 
that is of no big consequence. The same applies to the 2M/2m precomputed powers of p, each 
with a length of mflogp^] bits. Together, they only need 2M/2m ■ mflogp^] = Mflogp^] bits, 
that is, a 2m-th part of the memory of one polynomial. Hence, if both input numbers have N 
bits, total memory needed by DKSS_MUL is 

-AfDKSS_MUL(A^) ~ 30A^ bits. 

Let us now compare the memory requirements of DKSS_MUL to SMUL. According to (2.32), 
ALsmul(A^ 0 = bits. I wrote “Af'”, since in Chapter 2.9 “A"” describes the length of the 
produet, hence N' = 2N to adjust notation to this chapter. Ergo, the approximate amount of 
temporary memory for SMUL is Msmul(A) = 4A' = 8N bits. 

Figure 20 shows an overview of actual memory consumption for selected input sizes for both 
DKSS_MUL and SMUL. The lengths chosen are the most favorable lengths for DKSS_MUL. At those 
lengths, the coefficients of the polynomials in DKSS_MUL are fully filled with bits from input 
numbers a and b (as much as possible, as the upper half of each polynomial still has to be zero 
to leave room for the product). Increasing the lengths by one would lead to the least favorable 
lengths that need about double the memory for almost the same input length. 
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The column “DKSS_MUL blow-up” shows the quotient of DKSS_MUL memory requirements and 
the size of one input factor in bytes. The column “SMUL blow-up” shows the same quotient 
for SMUL. The column “Q” shows the quotient of DKSS_MUL and SMUL memory requirements. 
Column “DKSS_MUL blow-up” nicely fits the approximated memory of 30A^ as well as column 
“SMUL blow-up” supports the approximated memory requirements of 8N. 


4.6 Source Code Size 


Given the description of the DKSS algorithm in Chapter 3, the implementation is relatively 
straight-forward. About one third of the newly written code is needed for performing polynomial 
arithmetic: addition, subtraction, comparison, cyclic shifting and output and furthermore, us¬ 
ing Kronecker-Schonhage substitution, multiplication, squaring and exponentiation. The other 
two thirds are taken up by the core DKSS routines, code to compute the primes p and other 
supporting code. 

Underlying the DKSS code are routines that had to be written, but are not otherwise mentioned 
here, since they are not an immediate part of DKSS multiplication, like: factoring of long integers 
into primes and Lucas primality test [CP05, sec. 4.1] (for the computation of primes p for rings 
V), extended Euclidean algorithm (to compute modular inverses in Hensel lifting and Lagrange 
interpolation), a C++ class for long numbers (to handle non-time-critical calculations easily), a 
faster division with remainder (see [Knu97b, ch. 4.3.1, p. 272] and [BZll, ch. 1.4.1]). Other code 
that was used had already been written before: basic arithmetic, benchmarking code for speed 
tests, the Lucas-Lehmer test for primality for Mersenne numbers and a database of Mersenne 
number primality test results. 

To give an idea about the size of the source code of DKSS multiplication, the following table 
shows the counts of lines of code. The second column (“Total source lines”) contains the count 
including test and debug code, assertions, comments and empty lines, while the third column 
excludes those and only counts lines of code that actually do work in a production version 
(“Pure code lines”). The big difference in numbers is mostly because of test code. The above 
mentioned underlying routines are not included in the counts. 


Description 

Total source lines 

Pure code lines 

Polynomial arithmetic 

958 

295 

Core DKSS multiplication 

1374 

336 

Precomputation of primes p 

139 

86 

Other supporting code 

279 

157 

Total program code 

2750 

874 

Table of precomputed primes p 

1707 

1705 

Total 

4457 

2579 


“Table of precomputed primes p” contains an array of 1703 prime numbers of the form /i • 2” -t-1 
for each bit length from 2 to 1704, with the smallest odd h. Data from this array is needed for 
DKSS_MUL, but it doesn’t really qualify as code, because it’s only a list of constants. Since only 
values for p are used that are a multiple of 64 bits long and input numbers are limited by the 
64-bit address space of the CPU, a list with 6 values for p would have done as well. 
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Compare this to the line counts of the implementation of Schonhage-Strassen multiplication: 


Description 

Total source lines 

Pure code lines 

Core SMUL multiplication 

805 

323 

Fast cyclic shifts 

518 

253 

Other supporting code 

414 

237 

Total 

1737 

813 


The row “Fast cyclic shifts” shows a special feature of the SMUL implementation: I went to great 
lengths to write fast cyclic shift code that takes advantage of different shift counts (like word- 
or byte-aligned). The original function for cyclic shifts had only 4 lines! 


4.7 Profiling 


To get a feeling for which parts of DKSS_MUL use up the most computing time, I did some 
profiling of the code. Visual Studio’s built-in profiling did not perform very accurately and 
I had some technical difficulties. So instead I used a small self-made solution: I timed the 
execution of certain code parts manually. 

This is not a thorough investigation, but just serves to gain a better understanding where hot 
spots of execution lie. Thus, I have chosen just five different input lengths for measurement. 


In a first run, I measured the execution times for FFT setup (precomputation of p and its 
powers), the time needed for all three FFTs, pointwise multiplications and encode/decode/nor¬ 
malization of the result. 


Input length 
(words) 

FFT setup 

dkss_fft 0 

Pointwise 

multiplications 

En/decode & 

normalize 

3648 

18.00 % 

58.60 % 

16.55 % 

6.85 % 

28160 

2.26 % 

79.46 % 

12.88 % 

5.40 % 

221184 

0.66 % 

84.40 % 

10.53 % 

4.41 % 

10 878 976 

0.39 % 

88.56 % 

8.19 % 

2.86 % 

42 991616 

0.27 % 

87.71 % 

9.32 % 

2.70 % 


Figure 21: Profiling percentages for DKSS_MUL 


Figure 21 shows the results. I only present percentages of execution time. From this table 
several conclusions can be drawn: 

• Computation of p and its powers, something which has to be done before the FFT starts, 
takes a diminishing share of time as the input gets longer. When numbers are in the 
millions of words long, it doesn’t carry any significant weight in the overall run-time. This 
was to be expected. 

• The same holds in principle for encoding, decoding and normalizing of the polynomials. 
It’s more expensive than computing p and its powers, but with a decreasing share of the 
total cost. This too, was to be expected. 
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• Even the pointwise multiplications seem to be getting less prominent in the overall cost. 
Maybe this shows that parameters could be selected better? More research is needed here. 

• The one part which is taking a growing share of the total cost is the DKSS EFT itself. I 
cannot assess from this data whether the share will be ever growing or reaches a plateau. 
Still, most of the execution time is spent here, so this is why we look more closely into its 
run-time. 

In Figure 22 we see the percentages of execution time that are needed by the constituent parts of 
the DKSS EFT. It is performed by computing inner DFTs, bad multiplications and outer DFTs, 
which for their part are calculated by recursively calling the EFT routine and therefore again 
calculating inner DFTs and bad multiplications. The respective columns contain the execution 
time summed up over all levels of recursion. This table is normalized, so that total time of 
dkss_fft() is 100 %. 


Input length (words) 

Inner FFT 

Bad multiplications 

Rest 

3648 

22.24 % 

76.09 % 

1.67 % 

28160 

16.98 % 

80.90 % 

2.12 % 

221184 

16.20 % 

81.23 % 

2.57 % 

10 878 976 

10.21 % 

87.66 % 

2.13 % 

42 991616 

9.50 % 

89.36 % 

1.14 % 


Figure 22: Profiling percentages for dkss_fft() 


The column titled “Rest” contains some call overhead, the copying of into ei and the copy 
back of the coefficients into the a array. I suspected that cache thrashing would slow this 
process down a lot, but these results show that this is not the case. 

From this more specific analysis we learn that most of the time in dkss_fft() is used up by 
bad multiplications and their share is growing. That sure is a hot spot. So we will have a look 
into bad multiplications, which are multiplications of two arbitrary elements of TZ. 

Figure 23 shows a breakdown of execution time for multiplications of elements of TZ. Multiplica¬ 
tions are done by Kronecker-Schonhage substitution: encode polynomials as integers, multiply 
the integers, decode them back to polynomials, perform the “wrap around”, that is, the modulo 
(a™' -|- 1) operation, and perform the modulo p^ operation on the inner coefficients. Again, total 
time of bad multiplications was normalized to 100 %. 


Input length 
(words) 

m 

Words per 

inner coefficient 

Integer 

multiplication 

Modular 

reduction 

Rest 

3648 

16 

2 

47.33 % 

38.92 % 

13.75 % 

28160 

16 

2 

46.72 % 

39.69 % 

13.59 % 

221184 

16 

2 

46.57 % 

39.80 % 

13.63 % 

10 878 976 

16 

3 

57.37 % 

33.14 % 

9.49 % 

42 991616 

32 

3 

66.48 % 

26.67 % 

6.84 % 


Figure 23: Profiling percentages for bad multiplications 


Since Kronecker-Schonhage substitution depends on TZ, it only depends on parameters m and 
p^, but not M nor u. The first three rows have the same values for m and p^, so it fits the 
theory well that the percentages are more or less the same. 
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Time needed for modular reduction is not negligible and a better means than modulo division 
might save some time here {Fast mod operation for Froth moduli, [CP05, p. 457]). But the trend 
seems to be that for growing lengths the share of execution time needed for modular reductions 
is shrinking. 

In column “Rest” the times for encoding and decoding between polynomials and integers are 
lumped together. This seems to be quite slow and a more careful implementation could speed 
it up, but again, that percentage will only drop as input numbers get longer. 

From this profiling analysis we have learned that bad multiplications are really bad! Up to 
90 % of execution time is spent there and its share is growing. In order to reduce overall 
execution time, we should reduce the number of bad multiplications and/or make them cheaper. 
Maybe better parameter selection could reduce execution time here, which is left open for future 
research. 


4.8 Gazing into the Crystal Ball 


One goal of this thesis is to compare the speed of a DKSS_MUL implementation with an SMUL 
implementation. As was described in Section 4.4, SMUL is still much faster for the lengths tested. 

In addition, it would be interesting to estimate the input length where DKSS_MUL starts to be 
faster than SMUL. To do that, we look again at the most favorable lengths for DKSS_MUL, that is, 
the lower right points of the steps in Figure 19, where the execution time graph for DKSS_MUL 
is nearest to the SMUL graph. Figure 24 lists execution times at those points and the quotient 
of these times. Figure 25 shows a graph of the quotient of execution times vs. input length. 


Length 

(words) 

DKSS_MUL time 

(cycles) 

DKSS_MUL 

(min:sec) 

SMUL time 

(cycles) 

SMUL 

(min:sec) 

Quotient 

3648 

133 948102 

0:00.039 

4149 866 

0:00.001 

32.28 

7168 

288 821718 

0:00.085 

8 884 604 

0:00.003 

32.51 

14 336 

636 214 972 

0:00.187 

19131288 

0:00.006 

33.26 

28160 

1373 645 624 

0:00.404 

39 547108 

0:00.012 

34.73 

56 320 

2 908 271 180 

0:00.855 

81912 772 

0:00.024 

35.50 

110 592 

6 013189 608 

0:01.769 

179 448 020 

0:00.053 

33.51 

221184 

13 430 829 526 

0:03.950 

425 460 492 

0:00.125 

31.57 

434176 

29 461464 342 

0:08.665 

882 781300 

0:00.260 

33.37 

868 352 

62 917 787338 

0:18.505 

2167 722116 

0:00.638 

29.02 

1703 936 

122 680187946 

0:36.082 

4 576 352 552 

0:01.346 

26.81 

2 752 512 

199 424 397176 

0:58.654 

7495 493 476 

0:02.205 

26.61 

5 505 024 

410 390 455 672 

2:00.703 

15 269441152 

0:04.491 

26.88 

10 878 976 

892 949 727060 

4:22.632 

31013 681856 

0:09.122 

28.79 

21757 952 

1917 703 330120 

9:24.030 

65 485 660 216 

0:19.260 

29.28 

42 991616 

3 965 210 546 518 

19:26.238 

132 248 494 436 

0:38.897 

29.98 

85 983 232 

8145120 758 260 

39:55.624 

288 089 862 672 

1:24.732 

28.27 


Figure 24: Execution times of DKSS MUL and SMUL 
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At first sight, there is an apparent trend in the quotient of execution times. Looking at Figure 25 
we might, as a first approximation, assume a linear relationship between logj^Q N and the quotient 
of execution times. Linear regression with a least squares estimation leads to the line f{N) = 
— 1.54 • log^Q iV + 39.62, which has a correlation coefficient of —0.723. Solving f{N) = 1 leads 
to N ^ 10^® « 2^^ bits. 

On the other hand, analysis of SMUL execution times in Section 2.9.5 showed that SMUL reaches 
its “final” speed only above input lengths of about 512 Kwords (cf. Figure 13). So it seems 
that Figure 25 not so much shows the speed-up through improved speed of DKSS_MUL, but the 
slow-down of SMUL because of diminishing positive effects of caching. If we do linear regression 
with data points starting at input length 512 Kwords only, we get g{N) = 0.88 • logj^Q A^-|-22.11, 
that is, the quotient would be growing! Obviously, this type of analysis is not very reliable. 


As we did for SMUL in Section 2.9.5, we can use the measured data points to try to model the 
run-time of DKSS_MUL. Writing (3.19) with an explicit constant we get 

Ts{N) < N-log N^. (4.2) 

Calculating the constant 5 from each data point and plotting all of them gives the graph in 
Figure 26, with average S ~ 1.2727. In contrast to Figure 13, no effect of caching is apparent. 
We only have few data points, so this model of run-time is not very resilient. Yet, the modeled 
run-time matches the measured values ±10 % and even within ±5 % for input lengths > 28 160 
words. With the few data points we have, it seems to be the best we can do. 
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Taking (2.33) and (4.2) we can solve 


Ts{N) < T^{N) 

NlogN < aN log N -loglogN 

2-5 log* < a loglogN 
5 log* N < log cr + log log log N. 

For large N we substitute u := log log and with (3.16) get 

h(log* (log log A^) + 2) < log a + log log log N 

5(log* n + 2) < log cr + log u. (4.3) 

1 nl49 

Solving (4.3) numerically yields the enormous solution of v > 498 and hence N > bits! 

An optimistic estimation of the number of bits for computer memory available in this universe 
is So this crossover point is orders of orders of magnitude higher than any machine could 

hold that anyone could ever build. 

Even if DKSS_MUL was only about 2 times slower than SMUL, the crossover point would still be 
at ~ bits and thus unreachable. 









Chapter 5 


Conclusion 


De, Kurur, Saha and Saptharishi describe a new procedure to multiply very large integers 
efficiently (cf. Chapter 3, implemented as DKSS_MUL). The currently widely used all-integer mul¬ 
tiplication algorithm for large numbers is by Schonhage and Strassen [SS71] (my implementation 
is called SMUL, cf. Section 2.9). The run-time of DKSS_MUL is in a better complexity class than 
that of SMUL, meaning that if input numbers are long enough, DKSS_MUL will be faster than SMUL. 
Both algorithms were implemented and their run-time (Section 4.4) and memory consumption 
(Section 4.5) were compared (on a PC with 32 GB memory and a 3.4 GHz processor). 

The results indicate that Schonhage and Strassen’s multiplication algorithm is the better choice 
for a variety of reasons: 

1. SMUL is faster than DKSS_MUL. 

Benchmarks show that SMUL is still about 26 to 36 times faster than DKSS_MUL (Section 4.4 
and especially Figures 19 and 25). The estimate of the input length at which DKSS_MUL is 
faster than SMUL (Section 4.8) is bits (which is larger than googolplex), but 

even if SMUL was only 2 times faster than DKSS_MUL, the crossover point would be so large 
that it could never be reached. 

2. SMUL requires less memory than DKSS_MUL. 

If both input numbers are N bits long, DKSS_MUL requires about 30A^ bits of temporary 
memory, where SMUL requires only about 8N bits (Sections 4.5 and 2.9.4). The memory 
requirements of SMUL can not be lowered significantly, but there is an obvious possibility 
to lower DKSS_MUL memory consumption to its lower limit of about 20N bits that was not 
implemented. 

3. SMUL is easier to implement than DKSS_MUL. 

A simple implementation of SMUL needs about 550 lines of C++ code, where DKSS_MUL 
requires about 900 lines plus at least 6 lines of constants and more supporting routines, 
see Section 4.6. An improved and faster version of SMUL requires about 800 lines of code. 


It should be mentioned here that the SMUL implementation is better optimized than DKSS_MUL. 
The reason for that is that Schonhage-Strassen multiplication is now studied and in wide use 
for many years and its details are well understood. I have spent considerable time to improve 
its implementation. In contrast, DKSS multiplication is still quite young and to my knowledge 
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this is the first implementation of it. Section 5.1 describes several possible improvements to 
DKSS_MUL that could be realized. Still, in my appraisal none of them has the potential to speed 
up DKSS_MUL so much that it becomes faster than SMUL in the range of input lengths that was 
examined here or even in ranges that might be accessible in the future. 


5.1 Outlook 


In the course of writing, I encountered several possible areas for improvement. I list them here 
and try to assess their potential to improve run-time. 


• Find optimum values of parameters M, m, u and for any given N. 

Figure 19 still shows some areas where longer input numbers lead to shorter execution 
times. Furthermore, Section 4.7 shows some developments in percentages of run-times 
that could suggest that a better choice of parameters is possible. More research is needed 
to understand how to choose the fastest set of parameters. 

• Cache computation of p and its powers. 

This is an obvious possibility to save execution time, but it cannot save a great share 
when numbers get longer. Figure 21 shows how the percentage of execution time of “FFT 
setup” diminishes as numbers get longer. This has no potential to lower the crossover 
point. 

• Add support for “sparse integers” in the underlying multiplication. 

DKSS_MUL reduces multiplication of long integers to multiplications in 7^, a polynomial 
ring. When it comes to multiplication of two elements of 7^, they are again converted to 
integers (via Kronecker-Schonhage substitution, see Section 3.2.6) and have to be padded 
with zeros. About half of the words of each factor are zero and a future multiplication 
routine could exploit that. Profiling in Section 4.7 showed that up to 85 % of execution 
time is spent with multiplication of elements of IZ and a rising percentage of that is used 
by the underlying integer multiplication. I optimistically estimate the potential of this 
idea to speed up DKSS_MUL to be almost a factor of 2. 

• Count the number of non-zero coefficients in Kronecker-Schonhage substitution. 

We have to pad the polynomial coefficients for Kronecker-Schonhage substitution (cf. 
Section 3.2.6) with zeros, partly because multiple coefficient products are summed up 
and we must prevent that sum from overflowing. By counting the number of non-zero 
coefficients prior to multiplying them, we could upper bound the number of products. 
I estimate one or two bits of padding per coefficient product could be saved, but since 
coefficients are themselves at least 64 bits long, their product is at least 128 bits, so the 
potential saving can be no more than about 1-2 % and shrinks when numbers and thus 
coefficients get longer. 

• Implement dkss_fft() with less extra memory but matrix transposition instead. 

This is definitely an improvement that should be implemented, because it brings down 
the memory requirements from about 30Ai bits to about 20A^ bits (cf. Section 4.5). Yet, 
from the numbers obtained by profiling, I estimate the potential saving in run-time to be 
only a few percent at best. Furthermore, it seems that efficient matrix transposition by 
itself is non-trivial. 
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• Exploit the build of Proth prime numbers p. 

The modulus of P is a prime number of the form h ■ 2M + 1, where /i is a small positive 
odd integer and M is a power of 2. Maybe modular reductions can be sped up by the 
technique listed in [CP05, p. 457]. This has the potential to save a great part of the cost 
of modular reductions, which showed to cost about 22 % of run-time in profiling. 


If all potential savings listed above could be achieved, this would speed up DKSS_MUL by a factor 
of about 2.5. Not included in this factor is a better parameter selection. But even if that and 
other, yet unthought-of, improvements lead to another speed-up by a factor of 2, DKSS_MUL 
would still be at least 4.8 times slower than SMUL and need about 2.5 times more memory. As 
explained on page 63, even then the crossover point could never be reached. 



Appendix A 


Technicalities 


Tests and benchmarks were run on a machine with an Intel Core i7-3770 processor (Ivy Bridge 
microarchitecture) with 3.40 GHz clock rate. Hyper-threading, enhanced SpeedStep and Turbo 
Boost were disabled to enhance accuracy of timings. The mainboard is an ASUS P8Z77-V with 
32 GB PC-1600 dual channel DDR3 memory. 

The CPU has four cores, of which only one core was used while benchmarking. That is, the 
other cores were not switched off, but no other CPU-intensive process was running, except for 
the operating system itself. To improve cache performance, the process affinity was fixed to 
processor 2, which seems to get less interrupt and DPC load than processor 0. 

The CPU has level 1 caches per core of both 32 KB for data and 32 KB for instructions, unified 
level 2 caches of 256 KB per core and a unified level 3 cache of 8 MB for all cores. Caches 
lines are 64 bytes long and all caches are 8-way set associate, except the level 3 cache, which is 
16-way set associative. 

The operating system used was Windows 7 Ultimate with Service Pack 1 in 64-bit mode. 

For benchmarking, the priority class of the process was set to the highest non-realtime value, 
that is, HIGH_PRIORITY_CLASS. The thread priority was also the highest non-realtime value, 
THREAD_PRIORITY_HIGHEST. Together, that results in a base priority level of 13. 

Timings were taken by use of Windows’ QueryThreadCycleTime() function that counts only 
CPU cycles spent by the thread in question. It queries the CPU’s Time Stamp Counter (TSC) 
and its resolution is extremely good: even though the CPU instruction RDTSC is not serializing 
(so some machine language instructions might be executed out-of-order), the accuracy should 
be of the order of a 100 cycles at worst, most likely better. 

As development environment Microsoft’s Visual Studio 2012, vll.0.61030.00 Update 4 was used 
which includes the C++ compiler vl7.00.61030. Code was compiled with options /Ox (full 
optimization), /Db2 (expand any suitable inline function), /Oi (enable intrinsic functions), /Ot 
(favor fast code) and /GL (whole program optimization). 
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